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Preface 



These are the proceedings of CHES’99, the first workshop on Cryptographic 
Hardware and Embedded Systems. As it becomes more obvious that strong 
security will be an important part of the next generation of communication, 
computer, and electronic consumer devices, we felt that a new type of crypto- 
graphic conference was needed. Our goal was to create a forum which discusses 
innovative solutions for cryptography in practice. Consequently, the focus of the 
CHES Workshop is on all aspects of cryptographic hardware and embedded sys- 
tem design. Of special interest were contributions that describe new methods for 
efficient hardware implementations and high-speed software for embedded sys- 
tems, e.g., smart cards, microprocessors, or DSPs. We hope that the workshop 
will help to fill the gap between the cryptography research community and the 
application areas of cryptography. 

There were 42 submitted contributions to CHES’99, of which 27 were selected 
for presentation. All papers were reviewed. In addition, there were three invited 
speakers. 

We hope to continue to make the CHES Workshop a forum of intellectual 
exchange in creating the secure, reliable, and robust security solutions of tomor- 
row. We thank everyone whose involvement made the CHES Workshop such a 
successful event, and in particular we thank Murat Aydos, Dan Bailey, Bren- 
don Chetwynd, Adam Elbirt, Serdar Erdem, Jorge Guajardo, Linda Looft, Pam 
O’ Bryant, Marie Piergallini, Erkay Sava§, and Adam Woodbury. 



Corvallis, Oregon 
Worcester, Massachusetts 
August 1999 



^etin K. Koc 
Christof Paar 
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We Need Assurance 



Brian D. Snow 
National Security Agency, USA 



Abstract. Today’s commercial cryptographic products have sufficient 
functionality, plenty of performance, but not enough assurance. Further, 
in the near term future, I see little chance of improvement in assurance, 
hence little improvement in true security offered by industry. The ma- 
licious environment in which security systems must function absolutely 
requires the use of strong assurance techniques. Most attacks today result 
from failures of assurance, not function. 

Am I depressed? Yes, I am. The scene I see is products and services 
sufficiently robust to counter many (but not all) of the ’’hacker” attacks 
we hear so much about today, but not adequate against the more serious 
but real attacks mounted by economic adversaries and nation states. We 
will be in a truly dangerous stance: we will think we are secure (and act 
accordingly) when in fact we are not secure. 

Assurance techniques (barely) adequate for a benign environment 
simply will not hold up in a malicious environment. 

Despite the real need for additional research in assurance technology, 
we fail to fully use that which we already have in hand! We need to 
better use those assurance techniques we have, and continue research 
and development efforts to improve them and find others. 

Recall that assurance are confidence-building activities demonstrating 
that system functions meet a desired set of properties and only those 
properties, that the functions are implemented correctly, and that the 
assurances hold up through manufacturing, delivery, and life-cycle of the 
system. 

Assurance is provided through structured design processes, documen- 
tation, and testing, with greater assurance coming through more exten- 
sive processes, documentation, and testing. All this leads to increased 
cost and delayed time-to-market - a severe one-two punch in today’s 
marketplace. 

f will briefly discuss assurance features appropriate in each of the follo- 
wing five areas: operating systems, software modules, hardware features, 
third party testing, and legal constraints. 

Each of us should leave today with a stronger commitment to quality 
research in assurance techniques with strong emphasis on transferring 
the technology to industry, ft is not adequate to have the technique; it 
must be used. We have our work cut out for us; let’s go do it. 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, p. 1, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 




Factoring Large Numbers 
with the TWINKLE Device 
(Extended Abstract) 



Adi Shamir 

Computer Science Dept. 
The Weizmann Institute 
Rehovot 76100, Israel 
shamir@wisdom.weizmann.ac.il 



Abstract. The current record in factoring large RSA keys is the fac- 
torization of a 465 bit (140 digit) number achieved in February 1999 
by running the Number Field Sieve on hundreds of workstations for se- 
veral months. This paper describes a novel factoring apparatus which 
can accelerate known sieve-based factoring algorithms by several orders 
of magnitude. It is based on a very simple handheld optoelectronic device 
which can analyse 100,000,000 large integers, and determine in less than 
10 milliseconds which ones factor completely over a prime base consisting 
of the first 200,000 prime numbers. The proposed apparatus can increase 
the size of factorable numbers by 100 to 200 bits, and in particular can 
make 512 bit RSA keys (which protect 95% of today’s E-commerce on 
the Internet) very vulnerable. 



Keywords: Cryptanalysis, Factoring, Sieving, Quadratic Sieve, Number 
Field Sieve, optical computing. 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 2-12, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 




Factoring Large Numbers with the TWINKLE Device 



3 



1 Introduction 

The security of the RSA public key cryptosystem depends on the difficulty of 
factoring a large number n which is the product of two equal size primes p 
and q. This problem had been thoroughly investigated (especially over the last 
25 years), and the last two breakthroughs were the invention of the Quadratic 
Sieve (QS) algorithm [P] in the early 1980’s and the invention of the Number 
Field Sieve (NFS) algorithm [LLMP] in the early 1990’s. The asymptotic time 
complexity of the QS algorithm is and the asymptotic 

time complexity of the NFS algorithm is 0(e^-^^ numbers 

with up to about 350 bits the QS algorithm is faster due to its simplicity, but 
for larger numbers the NFS algorithm is faster due to its better asymptotic 
complexity. 

The complexity of the NFS algorithm grows fairly slowly with the binary 
size of n. Denote the complexity of factoring a 465 bit number (which is the 
current record - see [R]) by X. Then the complexity of factoring numbers which 
are 100 bits longer is about 40X, the complexity of factoring numbers which 
are 150 bits longer is about 220X, and the complexity of factoring numbers 
which are 200 bits longer is about llOOX. Since the technique described in this 
paper can increase the efficiency of the NFS algorithm by two to three orders of 
magnitude, we expect it to increase the size of factorable numbers by 100 to 200 
bits, or alternatively to make it possible to factor with a budget of one million 
dollars numbers which previously required hundreds of millions of dollars. The 
main practical significance of such an improvement is that it can make 512 bit 
numbers (which are the default setting of most Internet browsers in e-commerce 
applications, and the maximum size deemed exportable by the US government) 
easy to crack. 

The new factoring technique is based on a novel optoelectronic device called 
TWINKLE. ^ Designing and constructing the first prototype of this device can 
cost hundreds of thousands of dollars, but the manufacturing cost of each addi- 
tional device is about $5,000. It can be combined with any sieve-based factoring 
algorithm, and in particular it can be used in both the QS and the NFS algo- 
ritms. It uses their basic mathematical structure and inherits their asymptotic 
complexity, but improves the practical efficiency of their sieving stage by a large 
constant factor. Since this is the most time consuming part of these algorithms, 
we get a major improvement in their total running time. 

For the sake of simplicity, we describe in this extended abstract only the new 
implementation of the sieving stage in the simplest variant of the QS algorithm. 
Most of the new ideas apply equally well to improved variants of the QS algo- 
rithm and to the general NFS algorithm, but the details are more complicated, 
and will be described only in the full version of this paper. 



^ TWINKLE stands for “The Weizmann INstitute Key Locating Engine” . 
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2 The QS Factoring Algorithm 

Given the RSA number n = pq^ the QS algorithm tries to construct two numbers 
y and z such that y ^ ibz (mod n) and y‘^ = (mod n). Knowledge of such 
a pair makes it easy to factor n since gcd(y — n) is either p or q. To find 
such y and 2:^, we generate a large number of values yi, Z/2, • • • , Z/m, compute 
each y ‘1 (mod n), and try to factor it into a product of primes pj from a prime 
base B consisting of the k smallest primes p\ = 2,p2 = Numbers 

y ‘1 {mod n) which have such factorizations into 11^=1 called smooth. If 

the number of smooth modular squares found in such a way exceeds k, we can 
use Gauss elimination to find a subset of the vectors (ei, 62, . . . , e/^) of the prime 
multiplicities which is linearly dependent modulo 2. When the corresponding 
y^ {mod n) and their factorizations are multiplied, we get an equation of the 
form -1 P^/ {mod n) where all the 6^’s (which define the subset) 

are O’s and I’s and all the c^’s (which are the sums of the prime multiplicities) 
are even numbers. We can now get the desired equation y‘^ = z‘^{ mod n) by 
defining y = {mod n) and z = Y[^=iP^/^‘^ {mod n). 

The key to the efficiency of the QS algorithm is the generation of many 
small modular squares whose smoothness is easy to test. Consider the simplest 
case in which we use the quadratic polynomial f{x) = (a + x)^ {mod n) where 
a = l^/{n)\ , and choose yi = a + i for i = 1, 2 , . . . , m. Then it is easy to see that 
for small m the corresponding y‘^ = f{i) {mod n) are half size modular squares 
which are much more likely to be smooth numbers than random modular squares. 

The simplest way of testing the smoothness of the values in such a sequence 
is to perform trial division of each value in the sequence by each prime in the 
basis. Since the /(i)’s are hundreds of bits long, this is very slow. 

The QS algorithm expresses all the generated /(I), . . . , f{m) in the non mo- 
dular form /(i) = (a + i)^ — n (since m is small), and determines which of these 
values are divisible by pj from the basis B by solving the quadratic modular 
equation (a + i)^ — n = 0 {mod pj). This is easy, since the modulus pj is quite 
small. ^ 

The quadratic equation mod Pj will have either zero or two solutions d'- and 
di \ In the first case we can deduce that none of the /(i)’s will be divisible by 
Pj, and in the second case we can deduce that f{i) will be divisible by pj if and 
only if i belongs to the union of the two arithmetic progressions pj t r d'- and 
Pj ^ r -\- dj'’^ for r > 0. 

The smoothness test in the QS algorithm is based on an array A of m coun- 
ters, where the i — th entry is associated with f{i). The sieving algorithm zeroes 
all these counters, and then loops over the primes in the basis. For each prime pj, 
and for each one of its two arithmetic progressions (if they exist), the algorithm 
scans the counter array, and adds the constant log2{pj) to all the counters A{i) 

^ We ignore the issue of the divisibility of f{i) by higher powers of pj, since except 
for the smallest primes in the basis this is extremely unlikely, and we can explicitly 
add the powers of the first few primes to the basis without substantially increasing 
its size. 
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whose indices i belong to the arithmetic progression (there are about m/pj such 
indices). At the end of this loop, the value of A{i) describes the (approximate) 
binary length of the largest divisor of /(i) which factors completely over the 
prime base B. The algorithm then scans the array, finds all the entries i for 
which A{i) is close to the binary length of /(i), tests that these /(i)’s are indeed 
smooth by trial division, and uses them in order to factor n. 

Typical large scale factoring attacks with networks of PC’s may use m = 
100,000,000 and k = 200,000. The array requires 100 megabytes of RAM, and 
its counters can be accessed at the standard bus speed of 100 megahertz. ^ Just 
scanning such a huge array requires about one second. Well optimized implemen- 
tations of the QS algorithm perform the sieving in 5 to 10 seconds, and find very 
few smooth numbers. They then choose a different quadratic polynomial f'{x)^ 
and repeat the sieving run (on the same machine, or on a different machine wor- 
king in parallel). This phase stops when a total of A: + 1 smooth modular squares 
are collected in all the sieving runs, and a single powerful computer performs 
the Gauss elimination algorithm and the actual factorization in a small fraction 
of the time which was devoted to the sieving. 

In the next section we describe the new TWINKLE device, which is an ul- 
trafast optical siever. It costs about the same as a powerful PC or a workstation, 
but can test the smoothness of 100,000,000 modular squares over a prime base 
of 200,000 primes in less than 0.01 seconds. This is 500 to 1000 times faster than 
the conventional sieving approach described above. 



3 The TWINKLE Device 

The TWINKLE device is a simple optoelectronic device which is housed in an 
opaque blackened cylinder whose diameter is about 6 inches and whose height 
is about 10 inches. The bottom of the cylinder consists of a large collection of 
LEDs (light emitting diodes) which twinkle at various frequencies, and the top 
of the cylinder contains a photodetector which measures the total amount of 
light emitted at any given moment by all the LEDs. The photodetector alerts a 
connected PC whenever this total exceeds a certain threshold. Such events are 
related to the detection of possibly smooth numbers, and their precise timing is 
the only output of the TWINKLE device. Since these events are extremely rare, 
the PC can leisurely translate the timing of each reported event to a candidate 
modular square, verify its smoothness via trial division, and use it in a conven- 
tional implementation of the QS or NFS algorithms in order to factor the input 
n. 

The standard PC implementation of the sieving technique assigns modular 
squares to array elements (using space) and loops over the primes (using time). 
The TWINKLE device assigns primes to LEDs (using space) and loops over the 

^ Note that the faster cache memory is of little use, since the sieving process accesses 
arithmetic progressions of addresses with huge jumps, which create continuous cache 



misses. 
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modular squares (using time), which reverses their roles. This is schematically 
described in Fig. 1. 



PRIMES PRIMES 




CONVENTIONAL SIEVING OPTICAL SIEVING 

Fig. 1. Conventional vs. optical sieving: the boxed operations are carried out at the 
same time slice 



Each LED is associated with some period pj and delay dj , and its only role is 
to light up for one clock cycle at times described by the arithmetic progression pj 
r + dj for r > 0. To mimic the QS sieving procedure, we have to use nonuniform 
LED intensities. In particular, we want the LED associated with prime pj to 
generate light intensity proportional to log 2 {pj) whenever it flashes, so that the 
total intensity measured by the photodetector at time i will correspond to the 
binary size of the largest smooth divisor of the /(i) ^ We can achieve this by 
using an array of LEDs of different sizes or with different resistances. However, 
a simpler and more elegant solution to the problem is to construct a uniform 
array of identical LEDs, to assign similar sized primes to neighbouring LEDs, 
and to cover the LED array with a transparent Alter with smoothly changing 
grey levels. ^ Note that the dynamic range of grey levels we have to use is quite 
limited, since the ratio of the logs of the largest and the smallest primes in a 
typical basis does not exceed 24:1. 

To increase the sensitivity of the photodetector, we can place it behind a large 
lense which concentrates all the incoming light on its small surface area. The light 

^ Again, we ignore the issue of the divisibility of f(i) by higher powers of the primes. 
^ For example, we can assign primes to LEDs in row major order and use a filter which 
is dark grey at the top and completely transparent at the bottom, or assign primes 
to LEDS in spiral order and use a filter which is darkest at its center. 
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intensity measurement is likely to be influenced by many sources of errors. For 
example, the grey levels of the Alter are only approximations of the logs, and 
even uniformly designed LEDs may have actual intensities varying by 20% or 
more. We can improve the accuracy of the TWINKLE device by measuring the 
actual Altered intensity of each LED in the manufactured array, and assigning 
the sequence of primes to the various LEDs based on their sorted list of measured 
intensities. However, the QS and NFS factoring algorithms are very forgiving to 
such measurement errors, and in PC implementations they use crude integer 
approximations to the logs in order to speed up the computation. There are two 
possible types of errors: missed events and false alarms. To minimize the number 
of missed events we can set a slightly lower reporting threshold, and to eliminate 
the resultant false alarms we can later test all the reported events on a PC, in 
order to And the extremely rare real events among the rare false alarms. For 
typical values of the parameters, the average binary size of the smooth part of 
candidate values is about one tenth of their size, and only a tiny fraction of all 
candidate values have ratios exceeding one half. As a result, the desired events 
stand out very clearly as isolated narrow peaks which are about ten times higher 
than the background noise. 

We claim that optical sieving is much better than conventional counter array 
sieving for the following reasons: 

1. We can perform optical sieving at an extremely fast clock rate. Typical si- 
licon RAM chips in standard PC’s operate at about 100 megahertz. LEDs, 
on the other hand, are manufactured with a much faster Gallium Arsenide 
(GaAs) technology, and can be clocked at rates exceeding 10 gigahertz with- 
out difliculty. Commercially available LEDs and photodetectors are used to 
send 10 gigabits per second along Aber optic cables, and GaAs devices are 
widely used at similar clock rates as routers in high speed networks. 

2. We can instantaneously add hundreds of thousands of optical contributions, 
if we do not need perfect accuracy. Building a digital adder with 200,000 
inputs which computes their sum in a single clock cycle is completely unrea- 
listic. 

3. The optical technique does not need huge arrays of counters. Instead of using 
one memory cell per sieved value, we use one time slice per sieved value. Even 
with the declining cost of fast memories, time is cheaper than space. 

4. In the optical technique do not have to scan the array at the beginning in 
order to zero it, and do not have to scan the array again at the end in order 
to And its high entries - both operations are done at no extra cost during 
the actual sieving. 

In the remaining sections we flesh out the design of each cell and the archi- 
tecture of the whole device. We based this design on many conversations with 
experienced GaAs chip designers, and used only commercially available tech- 
nologies. We may be oA by a small factor in some of our size speed and cost 
estimates, but we believe that the design is realistic, and that someone will try 
it out in the near future. 
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4 Cell Design 



The LED array is implemented on a single wafer of GaAs. Each cell on this 
wafer contains one LED plus some circuitry which makes it flash for exactly one 
clock cycle every exactly Pj clock cycles with an initial delay of exactly dj clock 
cycles. The high clock rate and extremely accurate timing requirements rule out 
analog control methods, and the unavoidable existence of bad cells in the wafer 
rules out a prewired assignment of primes to cells. Instead, we use identical cells 
throughout the wafer, and include in each cell two registers, A and B, which are 
loaded before the beginning of the sieving process with values corresponding to 
Pj and dj, respectively. Eor a typical sieving run over m = 100,000,000 values, 
we need only log 2 {m) 27 bits in each one of these registers. 

The structure of each cell (described in Eig. 2) is very simple. Instead of using 
counters (with their more complicated designs and additional carry-induced de- 
lays), we use register B as a maximal length shift register based on a single XOR 
of two of its bits. It is driven by the clock, and runs until it enters the special 
state in which all its bits are ” 1” . When this is recognized by the AND of all the 
bits of register B, the LED flashes, and register B is reloaded with the contents of 
register A (which remains unchanged throughout the computation). The initial 
values loaded into registers A and B are not the binary representations of pj and 
dj, but the (easily computed) states of the shift register which are that many 
steps before the special state of all ”1”. That’s the whole cell design! 




Fig. 2. A single cell in the array 






Factoring Large Numbers with the TWINKLE Device 



9 



An important issue in such a high speed device is clock synchronization. 
Each clock cycle lasts only 100 picoseconds, and all the light pulses must be 
synchronized to within a fraction of this interval in order to correctly sum their 
contributions. Distributing electrical clock pulses (which travel slowly over long, 
high capacity wires) at 10 gigahertz to thousands of cells all over the wafer 
without skewing their arrival times by more than 10-20 picoseconds seems to 
be a very difficult problem. We solve it by using another optical trick. Since 
it is easy to construct in GaAs technology a small photodetector in each cell, 
we use optical rather than electrical clock distribution: a strong LED placed 
opposite the wafer, which flashes at a flxed rate of 10 gigahertz, and its pulses 
are almost simultaneously picked up by the photodetectors in all the cells, and 
used to drive the shift registers in a synchronized way. Since light passes about 
3 centimeters in 100 picoseconds, we just have to place the clocking LED and 
the summing photodetector sufficiently far away from the wafer to guarantee 
sufficiently similar optical delays to and from all the cells on the flat wafer. To 
avoid possible confusion between clock and data light pulses, we can use two 
different wavelengths for the two purposes. 

Computing the AND of 27 inputs requires a tree of depth 3 of 3-input AND 
gates, which may be the slowest cell operation. To speed it up, we can use 
a systolic design which carries out the tree computation in 3 consecutive clock 
cycles. This delays the detection of the special state by 3 clock cycles but keeps all 
the flashing LEDs perfectly synchronized. To compensate for the late reloading 
of register B, we simply store a modifled value of pj in register A. 

An improved cell design is based on the observation that about half the 
primes do not yield arithmetic progressions, whereas each prime in the other 
half yields two arithmetic progressions with the same period pj. In standard PC 
implementations this has little effect, since we still have to scan on average one 
arithmetic progression per prime in the basis. However, in the TWINKLE design 
the two cells assigned to the same pj can share the same A register (which never 
changes) to reload their separate B shift registers. In addition, the two cells can 
share the same LED and flash it with the OR of the two AND gates, since the 
two arithmetic progressions are always disjoint. We call such a combination a 
double cell, and use it to reduce the average number of registers per prime in 
the basis from 2 to 1.5. Since these registers occupy most of the area of the cell, 
this observation can increase the number of primes we can handle with a single 
wafer by almost 33%. 

5 Wafer Design 

We would like to justify our claim that a single wafer can handle a prime base of 
200,000 primes (which is the actual size used in recent PC-based factorizations). 
A standard 6 inch wafer has a total usable area of about 16 10^ square microns. 
Commercially available LED arrays (such as the arrays sold by Oki Semiconduc- 
tors to manufacturers of laser printers - see http: / /www.oki.co.jp/OKI/home/En 
glish/New/OKI-News/1998/z9819e.html for further details) have a linear den- 
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sity of 1200 LEDs per inch. At this density, each LED occupies a 20/x x 20/x 
square with an area of 400/x^, and we can fit about 40,000,000 LEDs on a single 
wafer. However, most of area of each double cell will be devoted to the three 27 
bit registers. Crude conservative estimates indicate that we can very comfortably 
fit each one of these 81 bits into an area of 1, 600/x^ using commercially available 
GaAs technology. We can thus fit the whole double cell into an area of less than 
160,000/x^, and pack 100,000 double cells into a single wafer. Such a wafer will 
be able to sieve numbers over a prime base of 200,000 primes. 

A simple reality check is based on the computation of the total amount 
of memory on the wafer. The 100,000 double cells contain 81 x 100,000 bits, 
or about one megabyte of memory. The other gates (XOR, AND) and diodes 
(LEDs, photodetectors) occupy a small additional area. This is a very modest 
goal for wafer scale designs. 

The cost of manufacturing silicon wafers in a commercial EAB is about $1,500 
per wafer, and the cost of manufacturing the more expensive GaAs wafers is 
about $5,000 per wafer (excluding design costs and assuming a reasonably large 
order of wafers). This is comparable to the cost of a strong workstation, but 
provides a sieving efficiency which is several hundred times higher. 

The TWINKLE device does not have a yield problem, which plagues many 
other wafer-scale designs: During the sieving process each cell works completely 
independently, without receiving any inputs or sending any outputs to neighbou- 
ring cells. Even if 20% of the cells are found to be defective in postproduction 
inspection, we can use the remaining 80% of the cells. If necessary, we can place 
two or more wafers at the same distance opposite the same summing detector, 
in order to compensate for defective cells or to sieve over larger prime bases. 

After determining the number of cells, we can consider the issue (which was 
ignored so far) of loading registers A and B in each cell with some precomputed 
data from a connected storage device. Silicon memory cannot operate at 10 giga- 
hertz, and thus we have to slow down the clocking LEDs facing the GaAs wafer 
during the loading phase. The A registers which contain the primes assigned to 
each LED can be loaded only once after each powerup, but the B registers which 
contain the initial delays have to be loaded for each sieving run. The total size 
of the 200,000 B registers is about 675 kilobytes. Such a small amount of data 
can be kept in a standard type of silicon memory, and transfered to the wafer in 
0.002 seconds on a 27 bit bus operating at 100 megahertz. This is one fifth the 
time required to carry out the actual sieving at the full 10 gigahertz clock rate, 
and thus it does not create a new speed bottleneck. 

The proposed wafer design has just 31 external connections: Two for power, 
two for control, and 27 for the input bus. The four modes of operation indu- 
ced by the two control wires consist of a test mode (in which the various LEDs 
are sequentially flashed to test their functionality and measure their light inten- 
sity), LOAD-A mode (in which the various A registers are sequentially loaded 
from the bus), LOAD-B mode (in which the various B registers are sequenti- 
ally loaded from the bus), and sieving mode (in which all the shift registers 
are simultaneously clocked at 10 gigahertz). We can briefly freeze the optical 
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clocking during mode changes in order to enable the slow electric control signals 
to propagate to all the cells on the wafer before we start operating in the new 
mode. 

Another important factor in the wafer design is its total power consumption. 
Strong LEDs consume considerable amounts of power, and if a large number of 
LEDs on the wafer flash simultaneously, the excessive power consumption can 
skew the intensity of the flashes. However, each tested number can be divisible 
by at most several hundred primes fron the basis, and thus we have a small 
upper bound on the total power which can be consumed by all the LEDs at any 
given moment in the sieving process. 

6 The Geometry of the TWINKLE Device 

The TWINKLE device is housed in an opaque cylinder with the wafer at the 
bottom and the summing photodetector and clocking LED at the top. Its dia- 
meter is determined by the size of the wafer, which is about 6 inches. Its height 
is determined by the uniformity requirements of the length of the various optical 
paths. 

To determine this height, we recall that light travels about 3 centimeters in a 
single clock cycle which lasts 100 picoseconds. To make sure that all the received 
light pulses are synchronized to within 15% of this duration, we want the length 
of the optical paths from the clocking LED to any point in the wafer and from 
there to the summing photodetector to vary by at most 0.5 centimeter. The 
simplest arrangement places both elements at the center of the top face of the 
cylinder, but this penalizes twice LEDs located at the rim compared to LEDs 
located at the center, and requires a cylinder whose length is about 110 centi- 
meters. A better arrangement uses several clocking LEDs placed symmetrically 
around the rim of the top face, and a single photodetector at the center of this 
face. A simple geometric calculation shows that the required uniformity will be 
attained in a cylinder which is just 25 centimeters (10 inches) long. 

7 Concluding Remarks 

The idea of using physical devices in number theoretic computations is not new. 
D. H. Lehmer managed to factor (relatively small) numbers and solve other 
diophantine equations by pedalling on a device based on toothed wheels and 
bicycle chains of various lengths (a replica of this ingenious contraption from the 
1920’s is located at the Boston Computer Museum). His device even included 
a photodetector to alert the rider when the solution was found, but its mode 
of operation was of course completely different from our implementation of the 
quadratic sieve. 

The TWINKLE device proposed in this paper demonstrates the incredible 
speed and almost unbounded parallelism which is offered by today’s optoelec- 
tronic techniques. We believe that they will And many additional applications 
in cryptography and cryptanalysis. 
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Abstract. The Cryptographic Challenges sponsored by RSA Laborato- 
ries have given some members of the computing community an oppor- 
tunity to participate in some of the intrigue involved with solving secret 
messages. This paper describes an effort to build DES-cracking hardware 
on a field-programmable system called the Transmogrifier 2a. A fully im- 
plemented system will be able to search the entire key space in 1040 days 
at a rate of 800 million keys/second. 



1 Introduction 

The RSA Cryptographic Challenges sponsored by RSA Laboratories [1] have 
provided some interesting opportunities for those in the computing area to be- 
come involved in the mystery and intrigue of discovering secret messages. One 
of the challenges was to break a straightforward version of the Data Encryption 
Standard, more commonly known as DES [2]. The brute- force approach is to 
search the entire key space consisting of 2^^ or about 7.2 x 10^^ keys. 

This paper describes a project to implement a DES cracking system in a 
general-purpose programmable hardware system called the Transmogrifier 2a 
(TM-2a) [3,4]. The TM-2a is a unique system of field-programmable gate arrays 
being developed at the University of Toronto that is intended for doing prototy- 
ping of hardware. A brief description of the TM-2a will be given in Section 2. 

In the remainder of this section, a brief overview of DES will be given and a 
review of other attempts at cracking DES will be given. Section 3 will describe 
our implementation of DES on the TM-2a. We will conclude and give some future 
work in Section 4. 



1.1 Overview of DES 

The simplest form of the DES algorithm takes a 56-bit encryption key and uses 
it to encode a 64-bit block of plain text data into a 64-bit block of output cipher 
text. Between an initial and final permutation, there are 16 essentially identical 
stages. In the first stage, one half of the data as well as the key goes through a 
function, F, and the result is exclusive-ored with the other half. Eor each suc- 
cessive stage, the same thing happens with the halves reversed. Eigure 1 shows 
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the data flow. Function F is shown in Fig. 2. The data and the key first go 
through the expander and reducer that do simple selection and/or replication of 
the input bits to generate two 48-bit words. These two words are then exclusive- 
ored to form a single 48-bit word, which then goes through a table lookup called 
the S-box substitution. The S-box substitution is shown in Fig. 3. It consists of 
eight 6-bit input, 4-bit output lookup tables. The lookup tables are predeter- 
mined functions that, along with the permutations, does most of the coding of 
the data. The same algorithm is used for decoding. Hence, if we run the output 
through the circuit again, we should get the same as what we started with. 
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Fig. 1. The basic DES pipeline. 



1.2 Other Attempts 

The DES standard has long been criticized as being susceptible to an exhaustive 
key search and there has been much discussion and many recent attempts to 
show that it is weak. 

One of the earliest analyses of a practical machine for doing this was done 
by Wiener [5] in 1993. In his appendix, there is a very detailed gate-level design 
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of a chip that could be implemented in a CMOS technology. He estimates that 
the chip can test keys at a rate of 50 million keys per second. This chip can be 
used as the basis of a machine that can reduce the search time down to hours or 
minutes depending on the available budget. A review of numerous other designs 
was also given by Wiener. 

Recently, the evolution of the world- wide web has made it possible to network 
together thousands of computers, ranging from low-cost personal computers to 
high-end workstations, all working on portions of the key space [6,7]. This was 
how the first RSA DES Challenge was solved in about 4 months [6]. 

A real hardware system, called Deep Craek^ was constructed by the Electronic 
Frontier Foundation (EFF) for under $250,000 and it was able to win the second 
RSA DES Challenge in 56 hours [8,9]. 

A world- wide web group, hosted by Distributed. Net [7], and EFF combined 
their technologies to solve the final DES Challenge in a record 22 hours and 15 
minutes [10]. 

The use of FPGAs as a means of building hardware to crack cryptosystems 
has been suggested by many in the past and we only cite a few here [11,12]. 
FPGAs are an obvious technology because of the relatively low cost. Although 
our system of FPGAs will not come close to meeting the speeds of the EFF 
Deep Craek or Distributed. Net systems, we present it here as another data point 
showing what can be done with some programmable hardware, which puts it 
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somewhere between an application-specific hardware approach, and a large net- 
work of general-purpose computers. 

We first describe our hardware system. 

2 The Transmogrifier 2a 

The Transmogrifier 2a (TM-2a) [3,4] is a second-generation field-programmable 
system that is constructed with field-programmable gate arrays (FPGAs). The 
TM-2a is a flexible rapid-prototyping system that offers high capacity and high 
clock rates. It is intended to be flexible enough to implement a wide variety of sy- 
stems. A simple way to visualize the TM-2a is to think of building a large FPGA 
using existing FPGAs and field-programmable interconnect chips (FPIGs). 

Figure 4 shows the resources available on one TM-2a board. There are two 
Altera lOKlOO [13] logic devices and four I-Cube IQ320 [14] FPIGs. Attached to 
each FPGA is up to 4MB of RAM. The FPIGs provide a programmable routing 
network that can be used to connect the FPGAs on the board to each other 
and to FPGAs on other boards. Each board also has a low-skew, programmable 
clock generator and a housekeeping FPGA that is used to monitor the system 
and communicate over a bus to the host system. When the host is on a network, 
then the TM-2a can be programmed and run remotely. 

There can be up to 16 boards in a system. Assuming that each FPGA can 
hold a circuit of about 60K gates, the size of a 16-board system is about 2-million 
programmable gates. 

The TM-2a is being used at the University of Toronto to prototype a number 
of hardware concepts such as 3-dimensional procedural texture mapping, head 
tracking, and image processing. When the RSA DES Challenge was announced, 
the TM-2a seemed like an obvious system for building a DES cracker. 

3 DES on the TM-2a 

In this section we describe the implementation of our DES cracking system on 
the TM-2a hardware. We first give a small primer on the Altera lOK series 
logic devices architecture and the design methodology that we use. Some of the 
history behind the development of the hardware is given before we describe the 
final implementation. We end with a summary of our results. 

3.1 The Altera lOK Logic Device 

The main building block of the Altera lOK logic device is called a Logie Element 
or LE. Each LE has a number of resources of which the important ones for us 
were the 4- input, 1-output look-up table (4-LUT), the cascade chain, and the 
programmable register. The LEs are grouped in blocks of eight called LABs with 
local routing within the LABs. The LABs are arranged in the chip as a matrix 
with another routing structure connecting the LABs. A lOKlOO has 52 columns 
and 12 rows for a total of 4992 LEs. 
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There are several ways to describe circuits that will be programmed in the 
device. These include schematics and various hardware description languages 
(HDLs). We chose to use AHDL, which is Altera’s proprietary HDL, instead of 
a language such as VHDL. With AHDL, it is easier to control the logic mapping 
and therefore get more efficient and faster designs than with a more generic 
language. The actual synthesis and place and route is done using Altera’s design 
system called Max+Plus II. 

3.2 Early DES Work on the TM-2 

Based on the work of Wiener [5] we understood that the goal was to build a 
pipeline capable of having a throughput of one key crack per cycle. Our first 
attempt [15] was based on an earlier version of the hardware, called the TM-2. 
The TM-2 was built at a time when the largest available FPGA was the Altera 
10K50, which has roughly half the capacity of the lOKlOO used in the TM-2a. 
Our TM-2 system has two boards, and four 10K50 FPGAs. On this system it 
was only possible to build half of the DES pipeline in a single 10K50. Therefore, 
we could only build two complete pipelines on the original TM-2 system. At that 
time the TM-2 only ran at 6.25 MHz, which was the limiting factor. This meant 
that we could crack keys at the rate of 12.5 million keys per second taking about 
183 years to search the space. 

Further analysis [16] of the work by Bernier showed that there were two areas 
that would limit the performance of the circuit. One was in the S-box circuitry 
and the other was in the interface circuitry that was used to communicate with 
the host. The interface could be easily decoupled from the rest of the circuit 
while the S-box needed more thought. A more serious problem we discovered 
was that the 1 OK 100 did not really have double the logic of the 10K50 despite 
what the part numbers might suggest! The reason has to do with how the FPGA 
manufacturer counts its gates. This meant that we could not simply combine our 
two 10K50 circuits to form a single DES pipeline in one lOKlOO. More analysis 
and optimization of the circuit area was required. 



3.3 The TM-2a DES Implementation 

The goal of the TM-2a implementation was to implement a complete DES pipe- 
line in a single lOKlOO, make it run as fast as possible, and then replicate it so 
that we could have 32 pipelines running in parallel. By doing this, we would not 
be limited by the interconnect network and crossing chip boundaries. It would 
also be much easier to replicate the pipelines across the system. 

The top-level organization in a single chip is shown in Eig. 5. The key maker ^ 
which is some sort of counter, is used to produce keys. The plain text is coded 
with each key and then compared with the given cipher text. The circuit stops 
when a match is found. 

There are enough resources to build all 16 stages as a large combinational 
circuit, but clearly this would be very slow. The next obvious step is to pipeline 
the logic by separating each stage with pipeline registers as shown in Eig. 6. 
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Fig. 5. Top-level structure of a chip. 



The problem with this design is that there are not enough resources. As the 
computation proceeds down the pipeline, it is necessary to also keep the key for 
that stage in a register meaning that 16 keys will have to be stored. This uses 
almost 20% of the available LEs in the 1 OK 100. We needed to find a key maker 
that would remember the 16 most recent key values without using so many 
registers. The next step would be to try and make the S-box logic go faster. 




Fig. 6. Simple DES pipeline. 



The Key Maker Our solution to the resource problem was to use a Linear- 
Feedback Shift Register (LFSR). By choosing the feedback taps correctly it is 
possible to generate each key exactly once. To remember previous keys, it is only 
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necessary to extend the shift register by 15 registers as shown in Fig. 7. As each 
key is generated, the older ones can be found by sampling the values at a shifted 
offset from the new key. A possible disadvantage of this is routing the extended 
bits to the rest of the pipeline, but this ended up not being a factor. 




KeyN 



KeyN-1 



Fig. 7. LFSR with extension to save old keys. 



Other advantages result from using an LFSR. An LFSR is much faster and 
simpler than binary counters, although in our case the key generation was not the 
critical path. Also, it is straightforward to serially preload the counter without 
additional logic and extra pins. 

A slight disadvantage of the LFSR is because it does not count in a linear 
sequence. This means that we have to be a bit more careful when dividing the 
key space across the chips. The simple solution is to fix the key space for each 
chip by pre-setting the high order five bits when we are using 32 chips. We then 
build an LFSR that is only 51 bits long instead of 56 bits long. 



Pipelining Possibilities Based on our previous work we knew that the S-box 
was the important critical path. Figure 8 shows one stage of the basic 16-stage 
pipeline and more details of how one bit of the S-box is constructed. 

An S-box is a 6-input, 4-output lookup table, which can be thought of as four 
6-input, 1-output lookup tables (6-LUT). The Altera device only has 4-LUTs 
so we had to find an efficient way to build the 6-LUTs. The straightforward 
solution is to have four 4-LUTs and a 4:1 multiplexer. The 4:1 multiplexer can 
be implemented as two levels of 2:1 multiplexers, which means that three levels 
of 4-LUTs are needed. A solution that uses only two levels and one fewer 4-LUT 
is shown in Fig. 8. This takes advantage of the AND gate that is available as part 
of the cascade chain in the LEs. An extra inversion is necessary at the output 
of the modified S-box but this can be absorbed transparently in the next level 
of logic. 

The modified S-box structure can be easily pipelined, almost for free because 
the output of each 4-LUT can be latched at no extra cost. Only two additional 
registers are needed to pipeline the 2-bit bus that is connected to the inputs of 
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Fig. 8. S-Box details and pipelining options. 



the second level of LUTs. However, the true cost of an additional pipeline stage 
must consider the context of the S-box in the full pipeline. 

The simple pipeline puts a register between each of the 16 stages. If we wish 
to add an additional pipeline stage, then there are three possibilities as shown 
in Fig. 8. The dashed line at the top shows the existing 64-bit pipeline register. 
The labels on the dashed lines show the number bits that have to be registered 
if pipelining were done at that level. We do not have enough resources to add 
registers at all of the levels. The most economic place is at the level that goes 
through the S-box because many of the register bits come for free as mentioned 
above. However, it is still necessary to register the 64 other bits at that level 
that do not go through the S-box. Unfortunately, adding this extra pipeline stage 
exceeded the resources available to us so we are left with the original simple 
pipeline. 



The Complete System The full system will consist of 32 complete DES pipe- 
lines, each running in one of the lOKlOOs on the TM-2a. Software running on a 
host machine will communicate with the hardware to monitor the status of each 
chip. In addition there is a separate daemon program that monitors the status 
of the TM-2a. Since the TM-2a is available to everyone on our network, it is 
essentially a common resource. Users make calls to the monitor to gain access 
to the machine and to load their circuits. The actual utilization of the TM-2a 
for other projects is low so we have modified the monitor to determine when the 
TM-2a is idle. During the idle periods, the DES cracker can be loaded and run. 
When the TM-2a is needed, then the current state, which is just the current 
key in the LESR, is saved so that it can be restored the next time the circuit is 
loaded. 
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3.4 Results and Status 

Our final design uses 4300 of the 4992 available LEs, which is about 86% of the 
resources. Adding an extra level of pipelining in the S-box was just 4% larger 
than what could fit in our FPGAs. It would have easily worked if we had 1 OK 130 
devices. 

The maximum clock speed reported by Max+Plus II is 25MHz. Since we 
are able to process one key per clock cycle, this gives us 25M keys/second per 
chip. By using all 32 chips available on TM-2a, a total throughput of 800M 
keys/second is achieved. To search through the whole key space of 2^^ keys, it 
would take 90.1 million seconds, or 25 thousand hours, which is about 1040 days. 
While this is clearly not fast enough for practical use, it represents a tremendous 
speed increase compared to what conventional computers can do within the same 
volume of space. If we could have improved the pipeline with one extra stage 
in the S-box^ the speed would have been over 40 MHz giving around 650days to 
search the key space. 

Since much of the structure of the circuit is reasonably regular and the data 
flows in one direction, we would have liked the option of hand-placing the logic 
to reduce routing delays. Evidence from other work using other devices shows 
that amazing speeds can sometimes be obtained, such as a 250 MHz cross- 
correlator [17]. Hand-placement was not an option with our devices. We do not 
know how much difference this would have made, given the hierarchical routing 
structure of the Altera lOK devices but it would have been nice to try. We feel 
that with a different EPGA architecture, we could have more easily optimized 
the design for speed. 

The TM-2a is estimated to cost about US$3300 per board and about 
US$60,000 for the 16-board system using prices from the fall of 1998^. If the 
desire is to always be using the current state-of-the-art EPGA then the above 
numbers are probably a good estimate for a starting point. 

However, this is much more than would be needed for a dedicated system of 
32 chips. A single-board system with 32 chips using similar technology to ours is 
estimated to be less than half the cost of a 16-board TM-2a system. The TM-2a 
is also using technology that is about 2 years old. When we revised the TM-2 
design to use the 1 OK 100s, we could have used larger and faster parts but this 
would have caused too much change to our design, given our desire to make the 
revision quickly. We would have had to redo our routing network because there 
would have been more pins, and the faster parts run at lower voltages, meaning 
our board design would have had to change too much. 

It is clear that as the density and speed of FPGAs continues to improve, it 
will become easier and easier to build a small fast machine to crack DES. 

We have successfully run the system on a two-board (four-FPGA) version of 
the TM-2a. At this point in time, summer of 1999, our 16-board system is being 

^ Our numbers are very approximate because we have always been fortunate that 
Altera was willing to donate the devices that we needed so we have been sheltered 
from a lot of the true costs. 
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commissioned. Our DES cracking circuit is the first application to run on it that 
uses all of the boards. 



4 Conclusions and Future Work 

In this paper, we have described the implementation of a DES cracking system 
on a general-purpose field-programmable hardware system. The goal was to de- 
monstrate the capabilities of field-programmable hardware and, in particular, 
the capabilities of our particular TM-2a field-programmable system. Although 
the system cannot compete with those that actually were able to solve the DES 
Challenges, our implementation does show how close technology is to being able 
to build machines capable of cracking DES without the aid of special-purpose cu- 
stom hardware or the organizational requirements of coordinating a large number 
of computers on a network. This technology is available to everyone. 

A 16-board TM-2a system can achieve a throughput of 800 million keys 
per second, which is still about 300 times slower than the last DES Challenge 
winner that was a combination of the EEE Deep Crack custom hardware and 
Distributed. Net’s roughly 100,000 computers. They were testing 245 billion keys 
per second when the key was found [9]. When compared to just the Deep Crack 
hardware, which can test over 88 billion keys per second, the TM-2a is about 
110 times slower. Based on our estimate of about US$30K for a dedicated 32- 
chip system, spending the same amount as EEE did would give us 8 times more 
performance, so that the EPGA system would only be about 14 times slower. 
By using a tool that allows more manual placement and routing and a similar 
generation of technology to Deep Cracky it is possible we could find another factor 
of 2 to 3 in performance. The difference between programmable and custom 
hardware then becomes even smaller. 

With very few modifications, our DES cracker can be used as an ordinary 
high speed DES encoder/decoder. 

Eor our own research into EPGA architectures and systems, the DES cracker 
circuit has given us a large benchmarking circuit. In future we plan to investigate 
more sophisticated ciphers such as RC5 [18]. 

Einally, it is clear that DES cracking hardware is quickly becoming within 
reach of many institutions because of the rapid improvement in EPGA techno- 
logy. 
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Abstract. In this paper, the modelling of a Crypto-processor m a FPGA chip 
based on the Rapid Prototyping of Application Specific Signal Processors 
(RASSP) design concept is described. By using this concept, the modelling is 
carried out iu a structural manner from the design capture in VHDL code to 
design synthesis in FPGA prototype. Through this process, the turnaround time 
of the design cycle is reduced by above 50% compare to normal design cycle. 
This paper also emphasises on Ihe crypto-processor architecture for space and 
speed trade-off; design methodology for design insertion and modification; and 
design automation from virtual prototypiug to real hardware. In which above 
60% of spatial and 75% of timiug reduction is reported in this paper. 



1 Introduction 



The design flow and the techniques of modelling a crypto-processor [1] in FPGA chip 
based on the RASSP [2, 3,4,5] are described in this paper. The modelling is made use 
of the VHDL platform. This platform has provided the perfect simulation and 
synthesis media for rapid prototyping. As well as, it also facilitated the design 
methodology of RASSP which promoting the design upgrades and re-uses. The 
modelled crypto-processor is designed for use in embedded digital systems which 
requiring area/speed/power trade-off, as crypto-processor is now commonly used in 
nowadays’ digital devices, such as in Electronic Fund Transfer (EFT) systems and 
electronics wallet using smart cards. 



This paper highlighted the procedures of modelling the crypto-processor from design 
to synthesis as in the following sections. In section 1.1 & 1.2, the background of the 
RASSP and the modelled crypto-processors are introduced. In section 2, the design 
process based on the VHDL virtual prototyping is described from the design 
specification, executable specification to detailed design. In section 3, the detailed 
design methodology of the crypto-processors is demonstrated. In section 4, the 
observations and results of this study are reported. In section 5, conclusions are made. 



g.K. Ko 9 and C. Paar (Eds.): CHES’99, ENCS 1717, pp. 25-36, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 
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1.1 Scope of RASSP 

Rapid Prototyping of Application Specific Signal Processors (RASSP) [2, 3, 4, 5] is a 
modern methodology of designing embedded digital system nowadays. It supports the 
design of processor through a structural framework. The framework of RASSP 
mainly emphasises on the top-down design, design re-use and model-year design 
concepts [11]. Implementing these design concepts will result in shorter time-to- 
market and first-time silicon success fabrication. 

In this study, those concepts are demonstrated by using the VHDL modelling via 
multi-level of abstraction, with all component objects defined in a standard open 
interface and technology independent specification. Based on this, it is not only 
provides the architecture reuse library components, but also supports the rapid 
insertion of a new element into an existing design for upgrades or modifications. 



1.2 Cryptography and Crypto-Processors 

Nowadays, cryptography is commonly used in commercial and banking sectors as 
Electronic Commerce created these urgent needs in Electronic Fund Transfer (EFT) 
application. In this paper, main focus is put on the modelling of symmetric crypto- 
processor which encrypt fixed-length of data block. The Data Encryption Standard 
(DES) [6] is often used as a basic building block in the existing cryptosystem, that 
difference applications are used in different ways. On the contrary, attacks on DES 
using linear cryptanalysis and differential cryptanalysis, as well as exhaustive search 
are also well Imown. Therefore, in order to strengthen the security level of the 
existing cryptosystem, various kinds of modification and upgrade of the DES 
algorithm are proposed which using DES components as a building block. Hence, 
modelling the DES algorithm in a RASSP design framework helps the rapid 
prototyping of a new design. This is benefited from the reuse of design information 
and functional block library from previous design, for instance, the Randomised-DES 
[7,8,9] proposed by T. Kaneko, K. Koyama and R. Terada and the Extended-DES 
[10] proposed by H.S. Oh and S.J. Han. These are DES-based cryptosystem which 
used DES components as a building block. 

Randomised-DES (RDES) [7,8,9] is a cryptosystem with an n-round DES in which a 
probabilistic swapping, SW(Rn, Sn), is added onto the right half output of each round 
as shown in Fig. I. It has been claimed that the n-round RDES is stronger than the n- 
round DES against differential cryptanalysis. 

Extended-DES [10] is a cryptosystem utilising the iteration F-function of the DES to 
extend the property of the algorithm in form of a matrix. It defines the input plaintext 
as 96-bits and the key size as 128-bits, as well as the order of the S-box is randomly 
arranged. The 128-bits key is divided into two independent key, and K^, and used 
the same key scheduling algorithm of DES for generating the subkeys. The encryption 
and decryption formulas of EDES are shown in Table I. With this extended 
configuration, it is verified to be less vulnerable to attack by differential cryptanalysis. 
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Where SW= S^V (R^, SJ 
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Fig. 1. The Randomised-DES (RDES) 



Table 1. Encryption and Decryption Formulas of EDES 



Encryption 


Decryption 


A = B., 

B = Q, Xorf(B,„K„) 
C = A,.,Xorf(B,.„K, ,) 


A., = C.Xorf(A.,,K,.) 

Bm = A 

C,,=B,Xorf(A„K, , 



2 Design Process by VHDL 

VHSIC Hardware Description Language (VHDL) provides a media of vendor, 
platform and technology-independent design method of describing, simulating, and 
documenting complex digital system. It helps the rapid prototyping application- 
specific simulatable and sysnthesisable VHDL models of various signal-processing 
functions. The support of multi-level of abstraction, as well as working at a higher- 
levels of abstraction, facilitates the design transfers from the system level algorithm to 
structural implementation. Through out the modelling process in VHDL, it supports a 
cost-effective means for rapid exploration of area, speed, and power requirements of 
the processor. It also facilitates the functional trade-offs of algorithm and architectural 
design alternatives at the very early stages in the design process. The design process 
of VHDL can be divided into three parts as shown in Fig. 2: they are design 
specification, executable specification and detailed design. 



2.1 Design Specification 

Design specification captures customer requirements and converts these system-level 
needs into processing requirements (functional and performance) by VHDL 
description. Functional and performance analyses are performed to properly 
decompose the system level description. The system process has no notion of either 
hardware functionality or processor implementation. It also specifies an appropriate 
set of parameters specifying the performance and implementation goals for the 
processor (size, weight, power, cost, etc.). The traditional approach is to utilise text- 
based files in a specific format to support extraction of key parameters by the 
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appropriate tools. Nowadays, VHDL is regarded as the unifying design representation 
language and tool integration approach for describing the design specification. 
Eventually, the design specification is translated into simulatable functions, which 
refers to an executable specification. 




Fig. 2. The Design process by VHDL 



2.2 Executable Specification 

An executable specification [12] is a behavioural description of a component or 
system module without describing a specific implementation. The description reflects 
the particular function and timing of the intended design as looking on the 
component’s interface level. During this process, the system level processing 
requirements are allocated to functional modules and each module is then verifying its 
specified functionality against the system requirements. The module is then integrated 
with other components of the system and to test whether an implementation of the 
entire system is consistent with the specified behaviour in the design specification. 
Finally, a virtual prototype is resulted in a detailed behavioural description of the 
processor hardware. 

In this stage, an extensive simulation of all components is carried out in any form of 
the above models which can be described as functionally, behaviourally or 
structurally. Simulation is carried out by using the VHDL system simulator and 
VHDL compiler. It is intended to verify all of the codes during this portion of the 
processor design. After this process, all modules are fully tested and resulted in a 
detailed behavioural description of the processor hardware. Thus, the result of the 
executable specification is the virtual prototype describing the custom modules down 
to individual components at the behavioural level with emphasis on interface 
behaviour rather than internal chip structure. 
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2.3 Detailed Design 

With the above processes, the design is modelled and verified through a set of 
extensive functional and performance simulations using integrated simulators in 
VHDL platform. At the completion of those simulations, the design is in the form of a 
fully verified virtual prototype of the system and the timing is also verified to ensure 
proper performance against the design specification. For the design to be realised in a 
physical hardware, in this stage, the executable specification of the processor is 
transformed into detailed designs in Register Transfer Level (RTL) and/or logic level 
which specifying the actual implementation technology. This process resulted in a 
detailed technology -dependent hardware layout and artwork, netlist, and test vectors 
of the entire processor. Making uses of that information, the processor can be put into 
real hardware for integration, as well as used for silicon fabrication. It is 
accomplished by using the VHDL design compile and the specific ASIC technology 
library to generate the vendor-specific hardware configuration details. 



3. The Crypto-Processor Model 

The crypto-processors are synthesised using the Synopsys VHDL integrated simulator 
and implemented in a Xilinx FPGA chip. The main task of the synthesis tool is to 
transfer the design into a virtual prototype with simulation and debugging of system 
functionality. The implementation tool is to realise the design in real hardware and 
used for design verification. In this section, the detailed modelling of the baseline 
algorithm, DES, is demonstrated. The reuse concept is also exercised in the RDES 
and Extended-DES models. Finally, some observations and results are shown. 



3.1 Top-Down Modelling of the DES 

To rapidly prototype the DES in VHDL, the procedures described in section 2 is 
deployed. First of all, the design specification is defined, i.e. the mathematical 
representations of the algorithm. Then, the algorithm is partitioned into functional 
modules for synthesis, in which, the algorithm is simulated in form of functional, 
behavioural and structural models. The modules are refined into smaller component 
which is implementable in FPGA architecture. Finally, the virtual prototype is 
transformed to detailed design of FPGA configuration netlist. 



Design Specification 

The design specification of the DES is the standardised algorithm defined in 
International Standard document [6]. As stated in the document, the DES algorithm is 
making use of a series of permutation, substitution and exclusive-or operations to 
scramble the data depending on a binary key. The core of the algorithm computation 
includes the Initial Permutation {IP), the Expansion Box {E-hox), the Substitution Box 
{S-hox), the Permutation Box (P-box), the Inverse Initial Permutation {IF^) and the 
Exclusive -OR {XOR) operations. By combining the E-box, S-box and P-box with the 
associated XOR operations, it forms the iteration function {E-box) which is the core 
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computation unit of the DES. In addition, the Key Schedule (KS) associated with the 
algorithm provides the 48-bits subkeys used in each round of iteration. The KS 
includes the Permutation Choice-1 and -2 (PC-I, PC-2), and a series of shift 
operations. 

According to the DES specifications, all computation of the above units follows a set 
of operation tables defined in the standard [6]. To capture the design for 
implementation, each operation table specified in the standard is coded as a functional 
entity in VHDL description. Eventually, a DES VHDL package, which translates the 
textual specifications into synthesisable VHDL code, is modelled. The package is 
then used for program coding, design validation and system integration. 



Executable Specification 

In this stage, the functionality of the algorithm is validated by simulating and testing 
the algorithm in VHDL simulator, ultimately completed with a fully verified virtual 
prototype of the algorithm. To achieve this, the process is conducted through a 
combination of functionality partitioning and synthesis at all levels of abstraction. The 
partitioning of the model into smaller modules also facilitated the reuse concept. 

The DES algorithm is partitioned into four top-level functional modules, including the 
F-box, the IP, IF^ and the KS. In those functional modules, their interfaces between 
sub-modules, as well as the resource requirements (performance/area) for each 
component module are specified. Probably, the functional module is further 
decomposed into lower behavioural level model, so as to form a layered-architecture. 
This layer approach made the module more manageable, understandable, reusable, 
and maintainable. This helps to facilitate the design reuse concept and to build an 
encapsulated library. For instance, the functional module, F-box, with the defined 
interface is shown in Fig. 3, and the partitioned behaviour module of the F-box is 
shown in Fig. 4, with each box represents a behavioural component model in VHDL 
description entity. 
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Fig. 3. Interface of F-box 
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Fig. 4. Partitioned behavioural modules of F-box 
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The behavioural VHDL description of each module is then designed and tested 
individually using the pre-defined VHDL DES-algorithm package as the component 
library. At this level, further model partitioning is also conducted for each sub- 
module, until a sufficient detail is resolved for physical construction, performance and 
area requirements in the specific technology platform. 

In this case, the S-box is further partitioned in smaller sub-modules with respect to 
satisfy the area and speed requirements. With this partitioning, the large function is 
implemented in a minimum of area and the delay in signal path, and thus achieves 
performance increase both in spatial and speed requirements. Finally, the entire 
algorithm is integrated by a structural model which interconnects all verified modules 
together. 

In this stage, the functionality of all modules is verified and the timing analysis is 
computed. This ends up with a technology-independent and fully functional verified 
virtual prototype of the algorithm. For it is synthesisable in real hardware, say a 
FPGA chip, it needs to forward to the VHDL design compiler with the specific FPGA 
technology library to generate the detailed FPGA configuration netlist. 

Detailed Design 

Detailed design will match the design into a physical reality. The virtual prototype 
synthesised in the pervious stage is ready for realisation in FPGA. In this stage, the 
entire design is converted into the FPGA netlist by the VHDL design compiler and 
the associated FPGA technology library. As a result, the Xilinx Netlist Format (xnf) 
file is generated. By making use of this netlist in the automated development 
environment, logic mapping, placement, and routing are done automatically and 
finally a FPGA configuration file is generated. Then the configuration file is stored 
inside an EPROM for programming the FPGA in the real hardware prototype. 

Finally, the DES algorithm is transformed into the FPGA configuration file. The 
spatial requirement of the algorithm in pipeline mode occupied 2,176 Configuration 
Logic Block (CLB) with a signal path delay of 164.96 ns. Detailed synthesis results of 
all modules are shown in Section 4. 



3.2 Design Re-use for the Randomised-DES and Extended-DES 



Randomised-DES [7,8,9] 

RDES is the case of design insertion in the existing DES algorithm. It is an extension 
of the DES by inserting a special modular, SWAP, in the algorithm as illustrated in 
Fig. 1. By the modular-design concept applied in the DES design, the VHDL DES- 
algorithm package library, as well as the verified functional, behavioural and 
structural models, are reused as the components for constructing the RDES. 

In the virtual prototyping stage, only the functionality of the SWAP module needs to 
verify as it is stated in the specification. The insertion of the SWAP module only 
affected the internal structural of the F-box which is retained with the same interface 
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Structure. Thus, all structural models remained in its defined interface, as well as their 
original interconnection between modules as in the DES model. The modified 
structural of the F-hox model is shown in Fig. 5. 



As a result, all of the stmctural models designed in the pervious DES model is reused. 
For the entire RDES algorithm design, modification is made only in the structural 
model of the F-hox. By this design reuse of the DES algorithm component library, the 
RDES algorithm is rapidly prototyped within one-man week. This is achieved by the 
top-down, model-year structural design methodology applied in the designing of the 
DES algorithm. 
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Fig. 5. Modified Structural Model of F-box for RDES. 



Extended-DES [10] 

EDES is the case of modifying the existing DES algorithm. EDES is just an extension 
of the DES by increasing its data block length to 96-bits and the key size to 128-bits, 
in addition with a special arrangement in the order of S-hox. In such, the processing 
block size is remained in 32-bits using the same DES iteration F-box as its core, and 
the same key-scheduling algorithm that is used in DES. In this case, only the top-level 
structure model is needed to re-design and the S-hox sub-module is needed to re- 
structure. In the S-hox, since the functionality of all the smaller components in the 
lower-level module is verified in the DES-algorithm library. It only needs to re- 
program the structural model according to the EDES design specification and then 
reinsert it back to the encapsulated library. The F-box module and all other sub- 
modules within the F-Box are not affected at all. 



Therefore, the modelling of the EDES required in this case is just to modify the 
structural level models. All functional models of the processing units are using the 
standard DES modules extracted from the encapsulated library built during the 
previous design. As a result, the virtual prototype of EDES re-defined the 
interconnection between modules in a structural model and this re-design is 
prototyped by one-man week. 





Modelling the Crypto-Processor from Design to Synthesis 33 



4 Observations & Results 

4.1 Space and Speed Requirements 

To prototype a design into a physical hardware, such as in FPGA chip, the space and 
speed constraints in physical device are not negligible. Since all electronic 
technologies deliver finite spatial resources for building functions and wiring 
resources for communications which are especially tight with FPGA. By using the 
top-down design concept to partition design functionality into small modules has 
facilitated the design optimisation against those constraints. 

During the modelling of DES algorithm, the following results are observed. By 
transforming the functional model directly to detailed design, the resultant 
requirement in space and delay are higher. In the partitioned behaviour model which 
module is in form of a small component, the resultant requirement is much lower. The 
results of the DES modules are tabled in Table 2. 



Table 2. Synthesis results of the DES module^ 



Module 


Un-partitioned Functional Model 


Partitioned Behavioural Model 




CLB 


Timing(ns) 


CLB 


Timing (ns) 


IP 


0 


0 


0 


0 


IP-1 


0 


0 


0 


0 


E-box 


N/A 


N/A 


0 


0 


P-box 


N/A 


N/A 


0 


0 


S-box 


230 


33.92 


96 


6.31 


XOR32 


N/A 


N/A 


16 


3.26 


XOR48 


N/A 


N/A 


24 


3.26 


F-box 


323 


42.38 


136 


10.31 


KS 


0 


0 


0 


0 


SWAP 


N/A 


N/A 


16 


3.62 



From the table, it is found that the spatial requirement of the partitioned s-box is 
reduced by 60% and the delay is reduced by over 80%. (CEB’s propagation delay is 
reduced from 7 stages to 2 stages). While in the partitioned F-box, a 60% reduction is 
achieved. CEB’s propagation delay is reduced from 9 stages to 4 stages, above 75% 
reduction is accomplished. With this result, the algorithm is more feasible for 
implementation in FPGA chip with benefits in both spatial and timing requirements. 
Those benefits are also encountered in the case of RDES and EDES implementations. 



^ Timing is measuring under the Xilinx Xfpga_4025e-3 library parameters: path Jull, 
delay jnax, max _paths, and WCCOM operation conditions 
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4.2 Design Insertion and Modification 

Through designing the DES algorithm in model-year architecture, by defining the 
module with open interface and partitioning functionality into small components, it 
can facilitate the rapid design insertion and modification. Like the cases of modelling 
the RDES and EDES, the turnaround time to prototyping those algorithms are reduced 
rapidly. In modelling of RDES and EDES, above 70% and 40% of the development 
time is eliminated respectively. This is achieved by the result of using the 
encapsulated library, as most of the functionality verification is exempted. Thus, the 
design reuse concept of the encapsulated component library has shown its advantage 
and significance in this aspect. 



4.3 Design Automation 

Beside the design methodology, the platforms for simulation, debugging, synthesis, 
logic placement, routing, test vectors generation and hardware implementation the 
design are also important. Any one of those elements cannot be omitted in the process 
of processor design and prototyping. Therefore, a standardised integrated 
development environment is essential for designer, so as to speedup design process 
and reduce design transfer/translation cumbersome. In this study, the use of Synopsys 
VHDL integrated platform has helped a lot in the design automation aspect from the 
design capture to the synthesis in hardware. 



4.4 Hardware Prototype 

The designs are realised in a 25,000 logic -gates FPGA chip for testing and 
integration. Those algorithms are synthesised in both recursive and pipeline mode of 
operations. The hardware prototype is as shown in Fig. 6. 




Fig. 6 Prototype of the Hardware 
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5. Conclusions 

Through out the modelling of the crypto-processors in this study, the design concept 
of rapid prototyping the application-specific signal processors is practised. By the 
described approach, it not only verifies design functionality early in the design 
process, but also provides the key to rapid prototyping and upgrading of signal 
processors, as the same time reduces the development time and costs significantly. 

Deployment of model-year design concept in rapid prototype has provides the use of 
previous models as a baseline for further developments. As in the cases of modelling 
the RDES and EDES algorithm, which using DES as a baseline, allowed the 
modification of the functional models in the virtual prototyping stages and allowed 
partitioning and re-targeting design during the synthesis activities. In this case, above 
50% of development time is reduced in modelling the RDES and EDES algorithms. 

On the other hand, in the VHDL development environment, design can automatically 
converts a VHDL description to a gate-level implementation in a given technology; 
and can automatically transform a synthesis design to a smaller or faster circuit 
through partitioning. In this experience, above 60% of spatial and 75% of timing 
reduction is achieved. In addition, capturing the design in VHDL technology- 
independent functional models for the virtual prototype also enhances reuse of 
functional primitives and generates the design in different technology. 
Simultaneously, it also provides a technology-independent documentation for a 
design and its functionality. 

To conclude, the modelling is carried out in a structural manner from the design 
capture in VHDL code to design synthesis in FPGA prototype. Through those 
prototyping procedures, the turnaround time of the design cycle is reduced; and 
through the modular design concept, the feasibility of design upgrade and 
modification is enabled. 
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Abstract. The Sandia National Laboratories (SNL) Data Eneryption Standard 
(DES) Applieation Speeifie Integrated Cireuit (ASIC) is the fastest known 
implementation of the DES algorithm as defined in the Federal Information 
Proeessing Standards (FIPS) Publieation 46-2. DES is used for proteeting data 
by eryptographie means. The SNL DES ASIC, over 10 times faster than other 
eurrently available DES ehips, is a high-speed, fully pipelined implementation 
offering eneryption, deeryption, unique key input, or algorithm bypassing on 
eaeh eloek eyele. Operating beyond 105 MHz on 64 bit words, this deviee is 
eapable of data throughputs greater than 6.7 Billion bits per seeond (tester 
limited). Simulations prediet proper operation up to 9.28 Billion bits per 
seeond. In low frequeney, low data rate applieations, the ASIC eonsumes less 
that one milliwatt of power. The deviee has features for passing eontrol signals 
synehronized to throughput data. Three SNL DES ASICs may be easily 
easeaded to provide the mueh greater seeurity of triple-key, triple-DES. 



1 Introduction 

Since 1977, the United States has had a Data Encryption Standard (DES). DES is a 
block cipher that operates on 64 bit blocks of data and uses a 56 bit key [2]. It is a 
Feistel-type cipher. Feistel Ciphers [5][7] operate on left and right halves of a block 
of bits, in multiple rounds. The block halves are exchanged (left for right) from their 
usual order after the last round. An important property of Feistel Ciphers is that the 
fimction /, employed by a Feistel Cipher to operate on a left or right half-block of 
data, need not be invertible to allow inversion of the Feistel Cipher. In DES, the 
function / can itself be considered a product cipher (or substitution-permutation 
cipher), since that function performs both substitutions (to introduce confusion) and 
permutations (to introduce diffusion). 



g.K. Ko 9 and C. Paar (Eds.): CHES'99, ENCS 1717, pp. 37-48, 1999. 
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Another important property of Feistel Ciphers is that due to their structure, 
decryption is performed using the same multiple round process as encryption, but 
using the subkeys (one required per round) in reverse order. By eliminating the need 
for two different algorithms (one for encryption and one for decryption), hardware 
implementation is simplified and real estate on a chip is conserved. 

A survey of the available integrated circuit implementations of the DBS showed 
only devices with throughputs below 0.5 Gbps, far below the encryption rates 
required to scale Asynchronous Transfer Mode (ATM) encryption beyond 10 Gbps 
(SONET OC-192c). The existing implementations appeared to implement the sixteen 
rounds of the DBS algorithm by iterating data through the hardware of a single round 
16 times, resulting in low throughputs and the inability to change key variables 
quickly. To achieve the high throughput and key agility required for high speed ATM 
cell encryption, a fully pipelined implementation of all 16 rounds of DBS with the key 
variables pipelined along with the data stages was designed and studied. 



2 Proof of Principle 

In order to study pipelined (and non-pipelined) implementations of the DBS, an Excel 
spreadsheet implementation of the DBS key schedule and algorithm was developed. 
This spreadsheet implementation of DBS enabled the designers to familiarize 
themselves with the algorithm, and to examine multiple options for hardware 
implementation of the key schedule. After verification of the proper operation of the 
spreadsheet, the well-tested descriptions of the permutations and ”S-boxes" were "cut 
and pasted" into the hardware description language, minimizing the opportunity for 
transcription errors. 

First, the permutations and S-boxes of a single round were implemented in 
ABTERA’s AHDB; compiled and simulated for ABTERA’s 7000, 8000, and 10k 
families of devices. The simulations indicated that these operations could be 
computed more swiftly in the 7000 series devices, but only a small portion of the 
required functionality could be fit into the smaller gate count devices. We found that 
only four of the sixteen rounds of the DBS algorithm would fit into an ABTERA 
1 OKI 00 device, necessitating four large Programmable Bogie Devices (PBDs) to fully 
pipeline all sixteen rounds. The key was pipelined through each stage along with the 
data in order to provide full key agility (the ability to change keys on each and every 
clocked word transfer, if desired). This pipelining of key as well as data also 
increased greatly the number of I/O pins required to transfer the data between the 
devices comprising the pipeline. 

The simulations of the circuitry in ABTERA 1 OK 100-3 speed grade devices 
showed that the time required to compute and latch a single round was 50 ns. Once 
the synchronous pipeline was filled, 64 bits of output every 50 ns would yield 
approximately 1.3 Gbps throughput with a latency of 800 ns. 

For certain "feedback" modes of operation, such as Cipher Block Chaining (CBC) 
[3], the output of the pipeline must be combined with the next input. This requires the 
pipeline to "run dry", with only one 64-bit data word traversing all the pipeline stages 
before the next word can be input to the pipeline. Therefore, the CBC mode 
throughput for the synchronous pipeline is 64 bits each 800 Ns, or 0.08 Gbps (one 
sixteenth of the full pipeline throughput). For this reason, an "asynchronous" version 
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of the pipeline (with no latches between stages) was analyzed, in order to maximize 
the CBC mode throughput by minimizing the pipeline latency. Analysis of this 
"asynchronous" pipeline showed that the total latency could be reduced to 650 Ns, 
improving the potential CBC mode throughput to 64 bits per 650 Ns, or 0.098 Gbps. 
Clearly, CBC and other similar feedback modes of operation are difficult to scale to 
high speed operation. 

For non-feedback modes of operation such as Electronic Codebook (ECB) [3] or 
Counter Mode [1], the full pipeline can be utilized. In order to achieve 10 Gbps 
throughput (SONET OC-192c), however, eight such 1.3Gbps pipelines would have to 
be operated in parallel. ATM cell order must be maintained, and cells of different 
Virtual Circuits (requiring different key variables) may be interleaved in the cell 
stream. The processing of more than one ATM cell in parallel therefore introduces 
great complexity into an encryptor. Since an ATM cell payload is 384 bits, evenly 
divisible into six 64 bit words, six is the practical limit for parallel operation of 64-bit 
encryption pipelines for ATM cell encryption. In order to achieve 10 Gbps 
encryption, the throughput of each of the six pipelines must be at least 1.7 Gbps, 
which is greater than the 1.3 Gbps predicted by the ALTERA 1 OKI 00-3 simulations. 

The pipelined DES design was then implemented as a CMOS Application Specific 
Integrated Circuit in order to achieve the increased throughput required to implement 
ATM cell encryption at rates greater than 10 Gbps 



3 SNL DES ASIC 

The Sandia National Laboratories (SNL) DES Application Specific Integrated Circuit 
(ASIC) is the fastest known implementation of the DES algorithm as defined in FIPS 
Pub 46-2 [2]. The SNL DES ASIC, over 10 times faster than other currently available 
DES chips, is a high-speed, fully pipelined implementation providing encryption, 
decryption, unique key input, or algorithm bypassing on each clock cycle. In other 
words, for each clock cycle, data presented to the ASIC may be encrypted or 
decrypted using the key data presented to the ASIC at that cycle or the data may pass 
through the ASIC with no modification. Operating beyond 105 MHz on 64 bit words, 
this device is capable of data throughputs greater than 6.7 Gbps, while simulations 
show the chip capable of operating at up to 9.28 Gbps. In low frequency applications 
the device consumes less that one milliwatt of power. The device also has features 
for passing control signals synchronized to the data. 

The SNL DES ASIC was fabricated with static 0.6 micron CMOS technology. Its 
die size is 11.1 millimeters square, and contains 319 total pins (251 signals and 68 
power/ground pins). All outputs are tristate CMOS drivers to facilitate common 
busses driven by several devices. This device accommodates the full input of plain 
text, 64 bits, and a complete DES key of 56 bits. Additionally, 120 synchronous 
output signals provide 64 bits of cipher text and the 56 bit key. 
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Fig. 1. DBS ASIC Block Diagram 

Three input only signals control electrical functions for logic clocking (CLK), logic 
reset (RST), and the tristate output enables (OE). The CLK signal provides 
synchronous operation and pipeline latching on the rising edge. Both RST and OE 
are asynchronous, active high signals. 

Two synchronous signals, decrypt/encrypt (DEN) and bypass (BYP), determine the 
DBS cryptographic functionality. On the rising edge of each CLK, the logic value 
presented to the DEN input selects whether input data will be decrypted (logic 1) or 
encrypted (logic 0). In a similar manner, BYP selects algorithm bypassing (logic 1) 
or not (logic 0) for each clock cycle. Both of these signals pipeline through the ASIC 
and exit the device synchronous with the key and data. 

Two more signals, start-of-cell (SOC) and data valid (VAL) enter and exit the 
device synchronous with data and key information. These are merely data bits that 
may provide any user-defined information to travel with input text and key. These 
signals are typically used to indicate the start of an ATM cell and which words in the 
pipeline contain valid data. 

ASICs from two wafer lots were shown to operate beyond the maximum frequency 
(105 MHz) of Sandia’s IC Test systems. For 64-bit words, this equates to 6.7 Gb/s. 
This operational frequency was tested over a voltage range of 4.5 to 5.5 Volts and a 
temperature range of -55 to 125 degrees C. 



3.1 Design 

After implementing the DBS algorithm in the set of four PLDs, the design was 
translated into VHDL and synthesized into the Compass library of standard cells. The 
device (figure 2) was fabricated in Sandia’s MDL (Microelectronics Development 
Laboratory). Two wafer lots were successfully fabricated. 

This implementation is a fully pipelined design. It takes eighteen clock cycles to 
completely process data through the pipeline causing the appropriately decrypted, 
encrypted, or bypassed data to appear on the ASIC outputs. Additionally, all key and 
control input signals pass through the pipeline and exit the ASIC synchronized to the 
ciphertext outputs. 

The SNL DBS ASIC is the only known fully pipelined implementation of all 16 
rounds of the DBS algorithm. Pipelining increased the device throughput by dividing 
the algorithm into equally sized blocks and latching information at the block 
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boundaries. This gives signals just enough time to process through each block 
between clock cycles, thereby maximizing the operational frequency. 




Fig. 2. DES ASIC Die (I I . I x 1 1 . 1 mm) 

Pipelining the algorithm allows a high degree of key and function agility for this 
device. Here, agility means that the SNL DES ASIC processes inputs differently on 
each clock cycle. As an example, the device may encrypt data with one key on one 
clock cycle, decrypt new input data with a different key on the very next clock cycle, 
bypass the algorithm (pass the data unencrypted) on the following clock, then encrypt 
data with yet another independent key on the fourth clock cycle. The control signals 
used to select these various modes of operation are presented at the output, passing 
through the device synchronized to the input data and the input key information. All 
inputs and outputs (control, key, and data) enter and exit the part synchronously. The 
authors know of no other single-chip implementation with all these features. 
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Features. This DBS ASIC is unique in its ability to encrypt or decrypt with a new 
key, or pass information unprocessed, on each and every clock cycle. This enables 
the separate encryption of many virtual channels within a high speed communication 
system. In addition, it enables cryptosystems to be built with fewer components and 
lower cost. Also, as stated previously, no other encryption chip outputs the key 
corresponding to each data word in every clock cycle. 

This per-cycle input and output of all variables facilitates cascading the devices for 
increased encryption strength, and paralleling the devices for even higher throughput. 
Another unique feature of this device is its ability to pass two user-defined control 
bits in synchronism with the data being encrypted, decrypted, or bypassed. This 
capability is indispensable for the design of ATM data encryptors, which must 
identify the Start of Cell boundaries and for systems that must flag data as “valid” or 
“not valid” in the encryption/decryption pipeline . 

Design Enhancements. Since the initial ASICs were fabricated, several 
enhancements have been identified. These enhancements would increase throughput, 
aid in cascading devices, and ease the use at the board level. Projected enhancements 
include use of improved design tools, improved synthesis options, low-voltage high- 
speed I/O buffers, improved pin-outs, greater parallelism, and processing in higher- 
speed technologies. 

Several design techniques could improve the design of the existing SNL DBS 
ASIC. For example, recent synthesis developments would allow the DBS ASIC to be 
redesigned with additional pipeline stages. A greater number of pipeline stages with 
improved timing, would increase the operational frequency and boost the throughput 
beyond 10 Gb/s. 

To enhance the high-frequency operation at the circuit-board level, higher 
performance input-output buffers would reduce switching noise. The present design 
uses CMOS level (0 - 5 V) interfaces. Future designs would incorporate these low 
voltage, low power I/O buffers. Bringing out the clock phased with the output data, 
would facilitate higher speeds and greater performance by enabling source 
synchronous clocking. Also, optionally inverting the encrypt/decrypt output would 
better facilitate encrypt-decrypt-encrypt triple DBS (described below). 

A redesign into Gallium Arsenide (GaAs) technology should yield a factor of 3 to 
4 improvement in speed of the ASIC. This would produce expected throughputs of 
30-40 Gbps. 

To achieve higher total throughput, multiple SNB DBS ASICs can operate in 
parallel, with each ASIC processing a 64 bit block of the data stream. Figure 3 
contains an example of multiple devices, performing DBS operations on two blocks of 
data in parallel. 

Because the data outputs of the SNB DBS ASIC are tri-stated, there are several 
ways the ASICs can be used in parallel. The data outputs from both ASICs can be 
connected to a single output bus in a time-multiplexed fashion and the two DBS 
ASICs can be operated using opposite clock phases to double the data throughput to 
greater than 13 Gbps. If both ASICs are driven off the same clock edge, the two 64 
bit wide data outputs can also be combined into a single 128 bit wide output to 
achieve the 13 Gbps throughput. 
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Fig. 3. Parallel DES ASIC Implementation 

For encryption of Asynchronous Transfer Mode (ATM) communication sessions 
where six SNL DES ASICs could operate in parallel on 64 bit blocks to encrypt a 384 
bit payload, 40 Gbs (OC-768) rates could be achieved. The authors would expect six 
parallel DES ASICs made using a GaAs process to support encryption at 160 Gbps 
and beyond. 

Power Consumption. Being a fully static CMOS device, the power usage is 
proportional to operating frequency. At 105 MHz, the SNL DES ASIC consumes 6.5 
Watts of power. While designed to dissipate the heat generated in high-bandwidth 
applications, the SNL DES ASIC can be operated at much lower data clock rates, 
consuming very little power, thus enabling many low speed, extremely low power 
applications. 



Table 1. Power Consumption of SNE DES ASIC 
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In the 1200 to 640,000 bits per second range, the DES ASIC consumes only 
microwatts of power. The SNL DES ASIC operating at 10 KHz (640,000 bits per 
second, or ten 64 Kbps voice channels) only consumes 510 microwatts. Iterating 
around the ASIC to triple encrypt a single 64 Kbps voice channel, the SNL DES 
ASIC would need to operate at about 3 KHz, requiring well less than half of a 
milliwatt of power. Triple encrypting lower data rate channels (1200-28800 bps) 
requires even less power, enabling operation from a small battery or solar panel. 
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3.2 Package Development 

The SNL DBS ASIC has been packaged into three different packages including a 360 
pin PGA, a 503-pin PGA and a 352 pin BGA. The original 360-pin package was used 
in initial testing of the DBS ASIC performance. It was in this package that the DBS 
chip was shown to operate at over 105 MHz. Sandia had earlier developed a 1.1 
million gate PBD board that used 1 1 Altera 1 OK 100 devices. This board was used in 
the development of the DBS ASIC pipeline design, housing the four 1 OK 100 devices. 
It was determined that the SNB DBS ASIC could be used with the original PBD 11 
board, being substituted for a single 1 OKI 00 device, if a 503-pin equivalent package 
were available. Sandia designed an BR4 board onto which the DBS ASIC was wire 
bonded and 503 pins could be inserted. The chip-on-board package had to be 
designed to dissipate up to 5 watts produced by the DBS ASIC. This is accomplished 
by attaching the DBS die directly onto a gold plated copper insert that is attached to 
the BR4 board using a tin-lead solder preform. Pictures of a representative cross 
section and this package are shown in figures 4 and 5. 



Copper Heat Sink 




DES die 



Fig. 4. Cross Section of the 503-Pin Package 

The design of this package enables a heat sink and integrated fan to be attached to 
the back of the copper insert to enable the package to dissipate over 6 watts. The BR4 
printed wiring board uses 3 mil copper traces and spaces with 5 mil vias. This design 
also allowed the board to be used to connect the existing bus signal assignments from 
the PBDl 1 board to the appropriate key and text signals on the SNB DBS ASIC. Two 
versions of the package were designed and fabricated. Bach has a different wiring 
schematic designed to fit into a different socket on the PBDll board. SNB DBS 
ASICs in the 503 pin package were demonstrated at the Super Computing 98 
Conference in Orlando, November 1998. 

The SNB DBS ASIC has also been packaged in a 35 x 35 mm, 352-pin ball grid 
array (BGA) package. This is an open tool commercial package available from 
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Fig. 5. Pieture of the 5 03 -Pin FR4 Board Paekage 

Abpac Inc. (of Phoenix , Arizona, U.S.A.). The package was chosen not only for its 
capability to dissipate over 5 watts and smaller size, but also its low cost. Abpac’ s 
automated manufacturing capability enabled a reduction of over 20 times in 
packaging costs. This package is being used in the design of a triple key, triple DES 
encryption module. 



3.3 Applications 

Although mainly used as an encryption engine for single DES or triple DES 
cryptosystems, the SNL DES ASIC has other uses such as a data randomizer. Some 
encryption algorithms need to hide or obscure relationships between bits or bytes of 
data prior to encryption. Using the SNL DES ASIC on the front end as a randomizer 
introduces no significant delay to the host cryptosystem. In a similar vein, this device 
can be used as a pseudo random number generator as part of a larger cryptosystem. 

In a counter mode or filter generator cryptosystem (shown in figure 6), a linear 
recurring sequence (LRS) generator produces a sequence, which is fed to a non-linear 
function. The purpose of the non-linear function (in this case, an SNL DES ASIC) is 
to mask the linearity properties of the LRS. The output of the DES ASIC is then 
combined with the data, through an Exclusive-OR operation. 
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Fig. 6. Encryption/Decryption for Counter Mode 

This device can be used to keep encryption/decryption keys synchronized with the 
data in cascaded triple DBS implementations (described below). At the highest data 
rates, programmable logic devices (PLDs) may not be able to keep pace with the SNL 
DBS ASIC for passing keys with the data. Consequently, SNB DBS ASICs operating 
in bypass mode may be used to pass keys to subsequent encryption chips in step with 
the data. 

Triple DES. Triple DBS employs the Data Bncryption Standard algorithm in a way 
sometimes referred to as encrypt-decrypt-encrypt (B-D-B) mode. B-D-B mode using 
two keys, was proposed by W. Tuchman and summarized by Schneier in [7]. The 
incoming plaintext is encrypted with the first key, decrypted with the second key, and 
then encrypted again with the first key. On the other end, the received ciphertext is 
decrypted with the first key, encrypted with the second key, and again decrypted with 
the first key to produce plaintext. If the two keys are set alike, it has the effect of 
single encryption with one key, thereby preserving backward compatibility. While 
advances have been made in cracking single DBS cryptosystems [4], data protected 
by the SNB DBS ASICs using two key, triple DBS have a good degree of 
cryptographic robustness. 

Two key, triple DBS schemes (with 56 bit keys) can be cryptanalyzed using a 
chosen plaintext attack with about 2^^ operations and 2^^ words of memory [6]. (In 
terms of work, this is on par with two key, double DBS, which is susceptible to a 
known plaintext attack with 2^^ operations and 2^^ words of memory [6].) Although in 
theory this is a weakness, Merkle and Heilman [6] state that in practice it is very 
difficult to mount a chosen plaintext attack against a DBS cryptosystem. This makes 
two key, triple DBS significantly stronger than two key, double DBS, because an 
attack would now require 2^^^ operations (and no memory). 
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Triple DES can also be performed with three independent keys, using a separate 
key for each of the encryption and decryption operations. Triple DES with three 
independent keys gives a slightly higher level of protection, but is susceptible to a 
meet-in-the-middle attack requiring about 2“^ operations and 2^^ words of memory 
[7]. Again, with regards to compatibility, keys one and three could be set to the same 
value, to interoperate with Tuchman’s two key, triple DES, or all three keys could be 
set alike to interoperate with single DES. 

The SNL DES ASIC supports triple DES in several unique ways. For highest 
throughput speeds, multiple SNL DES ASICS can be cascaded to implement the 
encrypt-decrypt-encrypt mode. In situations where top performance is not needed, 
but board real estate, cost, or power consumption are constrained, a single SNL DES 
ASIC may perform triple DES in an iterative manner. 

Because keys and control information march in lock step with the data, a string of 
three SNL DES ASICs can be cascaded. This would be accomplished by connecting 
the output data, key out, and control information output pins of one ASIC to the input 
data, key in, and control information input pins on the next ASIC. To perform E-D-E 
triple DES, an inverter must be placed on the path of the encrypt/decrypt signal 
between ASICs. This way, the middle ASIC will always perform the opposite 
operation (encrypt or decrypt) from the first and last ASICs in the string. PLDs, or 
SNL DES ASICs set to bypass mode, can be used to provide the proper (18 clock 
tick) delay, so that the keys for the second and third encryption/decryption operations 
will arrive in synchronization with the appropriate data. An example of this, using 
two keys is shown in figure 7. 




Fig. 7. Cascaded, Multiple ASIC, Two Key, Triple DES Implementation 

By applying appropriate glue logic, the SNL DES ASIC can be used to perform 
E-D-E triple DES in an iterative manner by looping the data, key, and control 
information around the ASIC, processing the data three times. The glue logic will 
need to contain a two bit wide, 18 stage delay to count (in synchronization with the 
data, key, and control information) the number of times a given block of data has been 
processed. Logic will also be needed to invert the encrypt/decrypt bit between passes 
through the SNL DES ASIC. An example of this is shown in figure 8. 
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Fig. 8. Iterative, Single ASIC, Triple DBS Implementation 



4 Summary 

For this project, the authors explored how a representative “heavyweight,” 
unclassified encryption algorithm could be optimized and pipelined. This project has 
yielded a device that could be used in building encryption research prototypes. The 
project was successful, producing not only a research vehicle, but the fastest known 
ASIC implementation of DBS. 

The SNL DBS ASIC can support two- or three-key triple DBS using a multiple 
cascaded ASIC configuration at rates of 6.7 Gbps and beyond. It can also support 
very low power triple DBS, iteratively, in a single ASIC configuration. 
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Abstract. CRYPTON is a 128-bit bloek eneryption algorithm proposed as a 
eandidate for the Advaneed Eneryption Standard (AES), and is expeeted to be 
espeeially effieient in hardware implementation. In this paper, hardware designs 
of CRYPTON, and their performanee estimation results are presented. 
Straightforward hardware designs are improved by exploiting hardware- friendly 
features of CRYPTON. Hardware arehiteetures are deseribed in VHDL and 
simulated. Cireuits are synthesized using 0.35 |xm gate array library, and timing 
and gate eounts are measured. Data eneryption rate of 1.6 Gbit/s eould be 
aehieved with moderate area of 30,000 gates and up to 2.6 Gbit/s with less than 
100,000 gates. 



1. Introduction 

The explosive growth in computer systems and their interconnections via networks 
has changed the way we live. From politics to business, our lives depend on the 
information stored and communicated using these systems. This in turn has led to a 
heightened awareness of the need of the information security. To enforce information 
security by protecting data and resources from disclosure, a secure and efficient 
encryption algorithm is needed. Since the widely used encryption algorithms, DES or 
Triple DES, are no more considered secure enough or efficient for future applications, 
a new block encryption algorithm with a strength equal to or better than that of Triple 
DES and significantly improved efficiency is needed. 

Recently very high bandwidth networking technologies such as ATM and Gigabit 
Ethernet are rapidly deployed [1]. Network applications such as virtual private 
network [2] need high-speed executions of encryption algorithms to match high-speed 
networks. In order to utilize the high-speed networks at link speed, encryption speed 
as fast as 1 Giga bits per second is required. According to our experiment, the 
execution speed of Triple DES on Intel Pentium-II, 333MHz is only 19.6Mbps, and 
the highest performance of Triple DES from the commercially available encryption 
hardware is about 200 Mbps [3, 4]. 

CRYPTON [5, 6] is a 128-bit block encryption algorithm proposed as a candidate 
for the Advanced Encryption Standard (AES) [7]. In the evaluation performed by 
NIST, its software implementation on Pentium-Pro, 200MHz showed about 40Mbps, 
the best encryption and decryption speeds among the AES candidates’ [8]. Hardware 
implementations of CRYPTON are expected to be more efficient than software 
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implementations because it was designed from the beginning with hardware 
implementations in mind. The encryption and decryption use the identical circuitry, 
and there needs no large logic for S-boxes. Moreover it does not use addition or 
multiplication but only uses exclusive-OR operations, and the exclusive-OR 
operations can be executed in parallel. CRYPTON is considered as the most 
hardware-friendly AES candidate on several researches [9-11]. 

In this paper, hardware designs of CRYPTON Version 1.0 [6], which were 
optimized by exploiting the hardware-friendly features of CRYPTON, are presented. 
To maximize operation parallelism, key generation and data encryption operations are 
executed simultaneously. Some round loops can be unrolled for speedup without 
increasing the quantity of logic. S-boxes used for both key scheduling and data 
encryption are shared to minimize the area. Hardware architectures are described in 
VHDL and simulated using Synopsys Compiler and Simulator. Circuits are 
synthesized using 0.35 jim gate array library, and timing and gate counts are 
measured. 

This paper is organized as follows. In Section 2, CRYPTON algorithm is briefly 
introduced. In Section 3, our design considerations of CRYPTON hardware are 
described. In Section 4 and 5, the detailed hardware designs of CRYPTON and 
estimation results are presented respectively. Concluding remarks are made in Section 
6. 



2. CRYPTON Algorithm 

CRYPTON is a SPN (substitution-permutation network)-type cipher based on the 
structure of SQUARE [12]. CRYPTON represents each 128-bit data block 4x4 byte 
array and processes it using a sequence of round transformations. Each round 
transformation consists of four parallelizable steps: byte-wise substitutions, column- 
wise bit permutation, column-to-row transposition, and then key addition. The 
encryption process involves 12 repetitions of the same round transformation. The 
decryption process can be made identical to the encryption process with a different 
key schedule. The high level structure of CRYPTON is shown in Eigure 1. Eor details 
of CRYPTON algorithm, please refer to [6]. 



3. Design Considerations of CRYPTON Hardware 



3.1 Parallel Execution of Key Generation and Data Encryption 

Some block ciphers pre-compute round keys and store them. Then the stored keys are 
used repeatedly while encrypting. This style needs extra cycles for key setup, and 
large storage if all 12 round keys of 128-bit length are to be stored. However, 
CRYPTON can generate keys simultaneously with encryption. The time it takes to 
generate a round key of CRYPTON is insignificant compared to the time it takes for a 
round transformation, and this makes it possible to generate round keys with the 
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encryption proceeding. The parallel operation of hardware CRYPTON is illustrated in 
Figure 2. 

128-bit Data Input in 4x4 Byte Array 

256-bit User 
Key 



Key 

Schedule 



Round Key 
Ko 



Round Keys 
Ki 

(/= 1 .. 12 ) 



y Initial Key Addition 



Key Addition 



Repeat 12 Rounds 



Byte-wise S-box Substitution 



Column-wise Bit Permutation 



Column-to-row Byte Transposition 



Key Addition 



w Output Transformation 



Column-to-row Byte Transposition 



Column-wise Bit Permutation 



Column-to-row Byte Transposition 



Data Encryption 



128-bit Data Output 

Fig. 1. The Structure of CRYPTON. 



User Key Data Input 




Data Output 



Fig. 2. Parallel Execution of Key Generation and Data Encryption. 
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3.2 Loop Unrolling 

Because CRYPTON consists of 12 repetitions of round transformation, iteration is an 
inevitable choice for small area designs. Rather than building every round 
transformation separately, a data flow is made to pass the same hardware block 
repeatedly. We can build the whole encryption with only a small, basic building block 
by exploiting iteration. The block diagram for 12-cycle iteration is shown in Figure 
3(a). 

Although iteration results in small area designs, it accompanies additional path 
delay taken from multiplexer and register. To reduce the number of pass through 
multiplexer and register, we can unroll the loop so that one cycle contains double or 
many times of the round transformation logic. In Figure 3(b), two round 
transformations are performed in a cycle, but the number of components included in 
the design is one multiplexer less than that of Figure 3(a). We can find that the design 
in Figure 3(b) has better speed and area than the design in Figure 3(a). CRYPTON 
has a good tradeoff between speed and area when 2, 3, 4, 6, or 12 rounds are 
concatenated in a loop. 
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Odd Round 


Transformation 




Transformation 



MUX 



Register 



Loop Unrolling 



MUX 



Odd Round 
Transformation 



Even Round 
Transformation 
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(a) 



(b) 



Fig. 3. Loop Unrolling. 



3.3 Common Logic Sharing 

CRYPTON uses only a few simple operations as its basic components. Because the 
components appear in several different parts of the algorithm, we can make those 
parts share a common logic block rather than building many separate, redundant logic 
blocks. Common logic sharing will contribute to area reduction only if the reduced 
amount of logic by sharing is larger than the multiplexer logic added. Otherwise, it 
will only result in increased control burden and longer path delay. CRYPTON has 
byte substitution operation both in key scheduling module and encryption module. 
Because byte substitution charges a significantly larger area than 128 2:1 multiplexers 
do, there is a benefit of adopting common logic sharing as shown in Figure 4. 
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Fig. 4. Common Logic Sharing 



4. Hardware Design of CRYPTON 

In this section, we propose two hardware designs of CRYPTON. 



4.2 Two-Round Model 

Two-round model performs two successive transformations for even and odd rounds 
within a clock cycle. Two-round model is efficient in area-speed tradeoff as we saw in 
Section 3.2, and it has smaller area than designs with 3, 4, 6, or 12 rounds 
concatenated. 

Two-round model consists of three modules: encryption module, key scheduling 
module, and control module as shown in Figure 5. The following scheme describes 
how two-round model works. 

1. Load 128-bit input data at “DATA IN” port, and 256-bit user key at 
“USER KEY” port. 

2. If decryption is to be performed, apply logic high on “DECR” port until the 
encryption is finished. 

3. Start encryption by applying logic high pulse for one clock cycle at the “START” 
port. 

4. Check the output port “DONE” to see if the encryption is completed. 

5. If “DONE” port outputs logic high, read the encryption result through output port 
“DATAOUT”. 

6. “CycleO” is a signal indicating the key expansion cycle, and “Cyclel” tells the first 
cycle among 7 is going on. Both signals are used as control inputs for multiplexers. 
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The 12-round encryption process needs 13 round keys, from the round 0 key to the 
round 12 key. Generating 13 round keys will take 7 clock cycles because two-round 
model computes two keys within a clock cycle. Thus, the result is available 7 clock 
cycles later after logic high on “START” port is latched at the rising edge of clock. 



DATA_IN 

USER_KEY 

CLK 



nRESET 

START 



DECR 




Fig. 5. Top View of Two-Round Model 



The encryption module of two-round model is shown in Figure 6(a). In encryption 
module, two blocks for even round transformation and odd round transformation are 
cascaded serially, but the sequence of blocks in Figure 6(a) is somewhat different 
from that of Figure 1. Because round 0 has uniqueness that only key addition is 
performed in it, it will require extra 128 two-input exclusive-OR gates if we did not 
take the sequence shown in Figure 6(a). 

The most unwieldy part in CRYPTON might be the S-box byte substitution. The 
byte substitution logic has 16 256-entry tables. As Figure 6(a) shows, encryption 
module must have two byte substitution blocks because it is distinct for odd rounds 
and even rounds. In addition, key scheduling module has two byte substitution blocks 
for user key expansion, which is used only once when a new user key is set. Seeing 
that most of the other operations takes comparatively small area, incorporating 
separate byte substitutions for key expansion looks very mismatched for its rare 
usage. This lack of balance can be corrected by sharing byte substitution blocks 
between the expansion block of key scheduling module and round transformation 
block of encryption module. This scheme effectively reduces the total area but needs 
one more clock cycle only for key expansion when new expanded keys are to be 
computed. Figure 7(b) shows the new block diagram for encryption module with 
sharing of byte substitution blocks. In Figure 7(b), outputs “Ur” and “Vr” are fed to 
the inputs of two 128-bit registers and stored in them. Those stored values in the 
registers are used repeatedly for generation of round keys unless a new user key is set. 
The key scheduling area also has change in expansion block. Key scheduling area 
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takes f/^and V^[6] - the outputs of expanded key registers - as inputs, and performs 
operations later than byte substitution blocks. 



128-bit Input Data 




128-bit Output Data 



(a) (b) 



Fig. 6. Encryption Model of Two-Round Model 



The block diagram for key scheduling module is depicted in Figure 7. Key 
scheduling module conveys two successive round keys to the encryption module at 
every clock cycle. One of our design principles is parallel execution mentioned in 
Section 3.1, and we do not adopt any round-key pre-computation style for speed 
enhancement. 

There is design concern in the final stage of round-key computation. A round key 
is exclusive-OR sum of a round-constant, masking constants, and expanded keys. 
There are only 4 masking constants that can be easily built into combination logic, 
and expanded keys are unknown before the round, thus we have no design choices 
about them. But there are as many as 13 round-constants of 32-bit length, and one 
should decide if he will build wired logic out of pre-computed round- constants, or 
make the design compute the constants from the specified equation at each clock 
cycle. Computation of constants in fully parallelized design like Figure 8 needs at 
least 4 32-bit adders and two 32-bit registers. On the other hand using wired logic for 
the 13 constants introduces a little longer propagation delay in key computation. 
Because two-round model aims at small area, implementation with wired logic is 
chosen. 
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Even Round Key Odd Round Key 

Fig. 7. Key Scheduling Module of Two-Round Model 



4.2 Full-Round Model 

To make a fast design, pipelining or loop unrolling can be generally applied. In 
pipelining, the algorithm is partitioned into several stages, and this enables several 
data blocks to be encrypted simultaneously. This is possible in ECB mode because 
each result of block encryption is independent from others. However, modes except 
ECB require the previous result of encryption to be available to complete the present 
encryption, which makes pipelining useless. Since most of the recent application of 
block ciphers use chaining or feedback mode, speed enhancement through pipelining 
is not considered here. Instead, we make full-round model with 12 rounds fully 
unrolled. This full-round model computes 12 rounds of transformation without 
looping, and it is the fastest but the largest design among those exploiting loop- 
unrolling. The loop unrolling of 4 or 6 will have just intermediate values of area and 
speed between those of two-round model and full-round model. Block diagrams for 
fUll-round model are straightforward and shown in Eigure 8, Eigure 9, and Eigure 10 
In key scheduling module shown in Eigure 10, round-key output for the ^-th 
round will be Ke^ if encryption is performed, and Kd^ if decryption is performed. To 
achieve high speed, common logic sharing in expansion block is not adopted here. 
Eull-round model tells us area cost of the nearly pure algorithm because it does not 
need any registers or control unit. However, the 12 multiplexers in key scheduling 
area could not be removed. 
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Fig. 8. Top View of Full-Round Model 
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Fig. 9. Encryption Module of Full-Round Model 



5. Estimation Results 

Our estimations are all based on Synopsys DesignCompiler and Hyundai 0.35 jam 
gate array library. Table 1 shows area and speed of two-round model and full-round 
model each with two speed-area tradeoffs. In this paper, all estimations are for typical 
cases. “Gate” means a 2-input NAND gate, equivalent to 4 transistors. 
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Fig. 10. Key Scheduling Module of Full-Round Model 



Table 1. Speed and Area Estimations of Hardware CRYPTON. 



Estimated items 


Two-round model 


Full-round model 


Area 

critical 


Speed 

critical 


Area 

critical 


Speed 

critical 


Gate count 
(Cell Area) 


total gates 


18,322 


28,179 


46,259 


93,929 


encryption module 


8,267 


17,958 


33,598 


74,857 


key scheduling 
^module 


7,930 


8,078 


12,661 


19,072 


Minimum clock period (T^ J 


18.97 ns 


10.23 ns 


74.03 ns 


44.30 ns 


Key setup time 


on the fly 


on the fly 


10.13 ns 


7.91 ns 


Time to switch keys 


18.97 ns 


10.23 ns 


10.13 ns 


6.13 ns 


Time to encrypt one block 


132.79 ns 
(7 xTJ 


71.61 ns 
(7 xTJ 


74.03 ns 


44.30 ns 


Throughput 


898Mbps^ 


1.66Gbps^ 


1.61 Gbps 


2.69Gbps 



^ 1 Mbps = 1,024 X 1,024 bps = 1,048,576 bps 
^ 1 Gbps = 1,024 X 1,024 x 1,024 bps = 1,073,741,824 bps 
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The two results for two-round model have identical logic design, but are different 
only in optimization strategies. The one indicated as “Area Critical” was optimized to 
reduce as much area as possible. On the other hand, the result of “Speed Critical” was 
obtained by minimizing the clock period. As shown in “Time to switch keys”, one 
clock is spent on expanding the user key when the user key is switched to a new 
value. Since the design uses one clock source for its whole area, the time to switch 
keys will be at least “minimum clock period”, but the actual propagation delay in key 
expansion is less than the minimum clock period. Round keys are generated on the 
fly, and if the user keys remain same, 7 clock cycles are needed for both encryption 
and decryption. 

The same optimization strategies were applied to full-round model. However, in 
speed critical optimization, the path from data input to the final output was optimized 
rather than minimizing the clock period as in two-round model. 

Although results of full-round model were entered into the same format of table 
with two-round model, the numbers should not be compared directly because the two 
designs have different architectures. The following three explanations qualify the 
meaning of each item for full-round model. 

• Time to switch keys: time from the insertion of a new user key to the generation of 
the round 0 key (this is because time taken to compute the next round key is much 
shorter than time to compute one round transformation. By the time ^operation of 

round has been performed, all round keys from round to 12^^ round are 
available). 

• Key setup time: time needed to compute the whole 12 round keys. 

• Time to encrypt one block: The longest path delay from data input to the final data 
output (it is the time taken to encrypt one block when all round keys are set up and 
ready to be used). 

In two-round model, “Total Area” is larger than the sum of “encryption module” 
and “key scheduling module” because area of buffer for expanded keys and control 
unit is missing. Full-round model does not have any extra logic except encryption 
module and key scheduling module, and thus the sum is exactly matched. 

Two main issues in logic design are speed and area. Because speed and area are 
objects of tradeoff in most cases, we can get the best results on one criterion by 
optimizing for it while ignoring the other. On the other hand, the result is the worst 
one for the ignored criterion. We can find the approximate upper and lower bounds of 
speed and area of CRYPTON by once optimizing for area and then for speed. 

The main trade-off between space and time takes place in optimization of S-box. 
We could obtain as high speed as 2.6Gbps by growing the area of S-box, but the total 
number of gates was over 90,000. Resorting to speed-area tradeoff, two-round model 
faster than full-round model was possible contrary to our initial scheme of using loop 
unrolling to make a fast design. In our estimation, two-round model was found more 
moderate and practical both in area and in speed than full-round model. Although 
optimization in the synthesis tool was very useful to achieve a goal in speed or area, a 
better result was possible by modifying the design itself 
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6. Conclusions 

In this paper, we designed and proposed two hardware architectures of CRYPTON 
exploiting the inherent hardware-friendly features of CRYPTON. The architectures 
were described in VHDL and circuits were synthesized using 0.35 jim gate array with 
several speed-area tradeoffs. 0.9 Gbps with the smallest area of 18,000 gates and the 
fastest speed of 2.6 Gbps with less than 100,000 gates could be achieved. The speed 
of 2.6 Gbps is faster than the commercially available fastest Triple-DES chip with an 
order of magnitude. This is enough speed to support the Gigabit networks. Since 
CRYPTON has good scalability in gate count, a designer can select a proper speed- 
area tradeoff from the large set choices. 
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Abstract. We propose new fast implementation method of public- key 
cryptography suitable for DSP. We improved modular multiplication and 
elliptic doubling to increase speed. For modular multiplication, we devi- 
sed a new implementation method of Montgomery multiplication, which 
is suitable for pipeline processing. For elliptic doubling, we devised an 
improved computation for the number of multiplications and additions. 

We implemented RSA, DSA and LCDS A on the latest DSP (TMS320C6201, 

Texas Instruments), and achieved a performance of 11.7 msec for 1024- 
bit RSA signing, 14.5 msec for 1024-bit DSA verification and 3.97 msec 
for 160-bit ECDSA verification. 

1 Introduction 

Public-key cryptography is an important encryption technique. It can be applied 
to many practical uses such as electronic commerce systems and WWW systems 
for enabling digital signatures and key agreement. The server systems for them 
are required to process a vast number of public key operations. 

Additionally, for communicating with various kinds of clients, the server sy- 
stems are required to provide various public-key cryptography functions, such as 
RSA [15] , Diffie- Heilman key agreement [5], DSA [16] and elliptic curve crypto- 
graphy (ECC) [9] [12]. These functions are under standardization in IEEE P1363 

[ 17 ]. 

In this paper, we describe a fast implementation method using DSP as a 
cryptographic engine for server systems. In public-key cryptography, modular 
multiplications are the most time-consuming operations. A DSP can compute 
these operations efficiently with a fast hardware multiplier. Furthermore, a DSP 
can be used as the hardware engine for various algorithms since it is programma- 
ble. 

In the past, fast public key cryptographic implementations on DSPs have 
been reported [1][2][6]. They concentrated on the implementation of RSA using 
the latest DSP at the time. We implemented RSA, DSA and ECDSA over prime 
fields based on the IEEE PI 363 draft, and propose new implementation methods 
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suitable for DSP. Our methods concern modular multiplication and elliptic doub- 
ling. 

For modular multiplication, we devised a fast implementation method for 
Montgomery multiplication [14]. Our method is suitable for pipeline processing. 

For elliptic doubling, we devised a new method which reduces the number 
of multiplications and additions in comparison with that specified in the IEEE 
P1363 draft. In general, the running time of addition is considered negligible 
compared with that of multiplication. But in fact, the running time of addition is 
not negligible on a processor such as a DSP, which has a fast hardware multiplier. 

There are some reports concerning the fast implementation of ECC [3] [4] [13]. 
They used the special elliptic curve domain parameters (EC domain parameters) 
for speeding up. On the other hand, our implementation can use any EC domain 
parameters for the server systems. The server systems require high performance 
and communicating with client systems that use various types of EC domain 
parameters. 

We implemented public-key cryptography functions with our method on the 
latest DSP TMS320C6201 (Texas Instruments). This DSP can operate eight 
function units in parallel and has a performance of 1600 MIPS at 200 MHz. 
The performance achieved in our implementation was 11.7 msec for 1024-bit 
RSA signing, 14.5 msec for 1024-bit DSA verification and 3.97 msec for 160-bit 
ECDSA verification. 

We describe our improvement method for Montgomery multiplication in sec- 
tion 2, our elliptic doubling method in section 3 and the performance in section 

4. 

2 Fast Implementation Method of Montgomery 
Multiplication 

2.1 Montgomery Multiplication 

Basic algorithm. Set A > 1. Select a radix R co-prime to N such that R> N 
and such that computations modulo R are inexpensive to process. Let N' be 
integers satisfying 0 < N' < R and N' = —N ^ (mod R). Eor all integers 
A and B satisfying 0 < AB < RN^ we can compute REDC{A^ B) = ABR~^ 
(mod N) with Algorithm 1. 

Algorithm 1. Montgomery multiplication algorithm REDO. 

input : A,B,R,N. 

output : Y = ABR~^ (mod A). 

101 N' := -A-i (mod R) 

102 T :=AB 

103 M := {T (mod R))N' (mod R) 

104 T := T^MN 

105 T := T/R 

106 ifT>N then return T — N else return T 

If is a power of 2, line 105 can be computed fast with shift operations. 
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Modular multiplication with Montgomery method. Since REDC(A^ B) 
= ABR~^ (mod TV), it can not compute modular multiplication directly. But 
on reviewing REDC(AR^BR) = ABR (mod A"), it can be seen that REDC 
can compute modular multiplication by converting A (mod N) to AR (mod N). 
After this conversion, a series of modular multiplications can be computed fast 
with REDC. Eor example, we show an m-ary exponentiation [7] with REDC 
in Algorithm 2, where e is a A:-bit exponent and is an m-bit integer which 
satisfies e = ^(2"^)*e^ . 

Algorithm 2. m-ary exponentiation method with REDC. 

input : A, e, N, R 

output : Y = A^ (mod N) 

201 A' := A X R (mod N) 

202 := 1 X R (mod N) 

203 for i:=l to 2^ -1 

204 T\i] = REDC{T[i-l],A') 

205 next i 

206 Y := 1 X R (mod N) 

207 for i := — 1 down to 0 

208 for j:= 1 to m 

209 Y :=REDC{Y,Y) 

210 next j 

211 Y := REDC{Y,T[ei]) 

212 next i 

213 Y :=Y X R~^ (mod N) 

214 return Y 



REDC routine with single-precision. To implement REDC on general pro- 
cessors, multi-precision computation must be divided into iterations of single- 
precision computation. In [10], many types of REDC routines are constructed 
with single-precision computation. Algorithm 3 shows a Einely Integrated Ope- 
rand Scanning (EIOS) type of REDC routine in [10]. 

We will use the following notations. Capital variables such as A or i^, mean 
a multi-precision integer. Small letter variables such as a^, bj or tmpl mean a 
single-precision integer of rc-bit length. 

A multi-precision integer, for example A, is expressed as the series of single- 
precision variables (a^_i,a^_2 , . . . ,ao). The expression such as (a, 6) means the 
concatenation of single-precision variables a and b. We also use the expression 
such as (A, 6), which means the concatenation of a multi-precision variable A 
and a single-precision variable 6. 

In Algorithm 3, the block-shift is executed by reading from Pi and writing to 
pi-i. Note that the rc-bit variables tmp3 and ci have 1-bit value. 

Algorithm 3. REDC routine with single-precision computation. (FIOS [10].) 
input: A = (a^-i, a^-2, • • • , ^o), B = (6^-1, 6^-2, • • • , bo), N' = (n^_i, • • • ^0); 

R = {2^)Y 

output: Y = {yg,yg-i, . . . ,yo) = ABR~^ (mod N). 
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301 Y :=0 

302 for j := 0 to g — 1 

303 := yo + x 

304 ^ •= X Uq (mod 2^) 

305 {tmpA.tmpl) := tmpl + m x no 

306 (ci , Co) := tmp2 + tmpA 

307 for i := 1 to g — 1 

308 {tmp3, tmp2, tmpl) := + (ci , co) + x bj single-precision multiplication 

309 {tmpA,yi-i) := tmpl + m x n^ single-precision reduction 

310 (ci, Co) := tmpA + (tmp3, tmp2) carry computation 

311 next i 

312 (ci,co) := (ci,co) + 

313 Pg—l •— Co 

8 14 Vg •“ Cl 

315 next j 

316 ifY > N then Y :=Y — N 

317 return Y 



2.2 Proposed Method 

To speed up Algorithm 3 on a DSP, let us consider improving the core loop in 
lines 308-310 suitable for pipelining. For the improvement, we considered the 
following problems: 

(1) Single-precision multiplication in line 308 cannot execute until single-precis- 
ion reduction in line 309 and carry computation in line 310 finish. 

(2) The contents of the computation are different among single-precision multi- 
plication^ single-precision reduction and carry computation. 

(3) The result of carry computation^ (ci,cq) in line 310, has {w + l)-bit length 
value so that it must be processed as a multi-precision variable. 

We reviewed the computation to solve these problems. Figure 1 shows the con- 
struction of the core loop. On reviewing the carry processing in Fig.l, carry of 
the single-precision multiplication and carry of the single-precision reduction are 
added to C = (ci,cq), and C is input to the carry of single-precision multiplica- 
tion in the next loop. To review this processing, we combine the computation in 
the core loop as follows: 



(C\yi-i) := yi + C ^ ai X bj + m X Hi 



From this equation, we can divide the carry C into the carry Ci for the x bj 
and the carry C 2 for the m x as follows: 

{ci.tmpl) := yi + Cl -\- Oi X bj single-precision multiplication 
(c2,yi-i) := tmpl -\- C2 Ymx Hi single-precision reduction 
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A:(a,.i ..,0;,} I I , bp) ] 




Fig. 1. Construction of core loop in Algorithm 3. 



From these equations, we can see that problems (1), (2) and (3) are solved as 
follows: 

Problem (1) is solved because both carry ci and feed back to themselves, 
which enables single-precision multiplication to start computing without waiting 
until single-precision reduction finishes. Problem (2) is solved because the com- 
putation between single-precision multiplication and single-precision reduction is 
the same. Problem (3) is solved because the right term of these equations never 
exceeds 2^^ — 1 even if all single-precision variables in the right terms are 2^ — 1, 
so that the lengths of c\ and C2 do not exceed rc-bit. 

Algorithm 4 shows an improved routine of Algorithm 3. Figure 2 shows the 
construction of the core loop in Algorithm 4. 

Algorithm 4. Proposed Montgomery multiplication algorithm, 
input: A = (a^_i, a^_2 , . . . , uq), 6^-2, • • • , ^o), 

output: Y = {yg,yg-i, . . . ,yo) = ABR ^ (mod N). 

401 Y :=0 

402 for j := 0 to g — 1 

40s {ciRmpl) := yoPOi X bj 

404 Ta := tmpl x Uq (mod 2^) 

405 (c2Ampl) := tmpl + m x no 

406 for i := 1 to g — 1 

407 {ciRmpl) := yi P Cl -\- Oi X bj single-precision multiplication 

40 8 (c2, yi-i) := tmpl -h C2 + m x n^ single-precision reduction 
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Fig. 2. Construction of the core loop in Algorithm 4. 



409 next i 

410 (C2, Cl) := Cl + C2 + ijg 

411 yg_i:=ci 

412 Vg := C 2 

413 next j 

414 if¥>N then Y := Y - N 

415 return Y 



3 Fast Elliptic Doubling 

We used a Weierstrass equation, +ax + 6 (mod p) for the elliptic curve 

over prime fields where 4a^ + 276^ /=0 (mod p), and projective coordinate 
(A,y, Z) which satisfies (x,y) = {X/Z‘^ ^Y/Z^). 

For exponentiation, such as m-ary [7] or window method [7], m elliptic doub- 
lings and 1 elliptic addition are processed alternatively. Remarking on this point, 
the m-repeated elliptic doublings method is proposed in [8] which is concer- 
ned with the computation on affine coordinates over binary fields. Compared 
to m times elliptic doublings, this method reduces the number of inverses by 
computing 2 '^P for P = {x,y) directly without computing intermediate points 
2^P{1 <i<m—l). 

We also remark this m-repeated elliptic doublings method, but take another 
approach to decrease the number of computation in terms of projective coordi- 
nates over prime fields. Our method is based on the m times elliptic doublings 
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specified in the IEEE PI 363 draft [17] and also reduces the number of additions 
and multiplications. 

3.1 Reducing the Number of Multiplications 

In this section, we describe our m-repeated elliptic doublings method which 
requires smaller multiplications than the m times elliptic doublings specified in 
the IEEE P1363 draft. In our method, the temporary value used in the t-th. 
elliptic doubling is reused in the (t + l)-th elliptic doubling, and this eliminates 
2 multiplications. Therefore, our method requires 10 multiplications in the first 
elliptic doubling, but requires only 8 multiplications from the second doubling 
to the m-th. Let = 2 "^(Xq, Iq, ^ o), Algorithm 5 shows m times 

elliptic doublings specified in the IEEE PI 363 draft [17]. 

Algorithm 5. m times elliptic doublings specified in the IEEE PI 363 draft, 
input: Elliptic curve point (Aq, To, ^o); ^ ci'^d EC domain parameter a. 
output: Elliptic curve point (A^, = 2"^(Aq, Tq, ^o)- 

501 for i := 0 to m — 1 

502 Wi := aZf 

503 Mi:=3XfpaZf 

504 Si:=4XiY-^ 

505 Ti:=SY4^ 

506 A^+i := Mf - 2Si 

507 :=M,(5,-A,+i)-2- 

508 Zi+i := 2nA, 

509 next i 

If we consider Wi = aZf and = 2YiZi in line 502, 508, we notice that Wi 
can be computed from Wi = 22 which eliminates 2 multiplications. We 

show the improved routine of Algorithm 5 in Algorithm 6. 

Algorithm 6. Improved routine of Algorithm 5. 

input: Elliptic curve point (Aq, Yq/Zq), m and EC domain parameter a. 
output: Elliptic curve point (A^, T^, = 2"^(Aq, Tq, ^o)- 

601 Wq := oZq 

602 JMq 3Aq T Wq 

603 Sq := dAoTo^ 

604 To :=SYf 

605 Ai := - 2Sq 

606 Ti :=Mo(5o-Ai)-2o 

607 Zi := 2To^o 

608 for i := 1 to m — 1 

609 Wi'=2Ti-iWi-i 

610 Mi := 3Xf + Wi 

611 Si'=4XiY^^ 
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612 


Ti-.= 


sr/ 


613 


X+i 


:= Mf - 2Si 


6U 


U+i : 


:= Mi{Si — Xi+i) — Ti 


615 


Ai+1 


■■= 2YiZi 


616 


next i 




3.2 


Reducing the Number of Additions 



Generally, an addition is regarded as much faster than a multiplication, and its 
running time is not considered. But on a DSP, multiplication can be computed 
efficiently with a fast hardware multiplier, and the running time of addition 
is not negligible. Table 1 shows a comparison of the running time of a modular 
multiplication and a modular addition based on our implementation on the DSP. 

Table 1. Comparison of the running time of a modular multiplication and a 
modular addition @ 200 MHz. 





160-bit 


192-bit 


239-bit 


Multiplication 


1.36 /xsec 


1.76 /xsec 


2.68 /xsec 


Addition 


0.250 /isec 


0.254 /xsec 


0.291 /xsec 



In projective elliptic doubling, some computations such as modular multiplica- 
tion by 2,3,4, and 8 can be implemented by the combination of modular addi- 
tion(s) and subtraction(s). Appending modular multiplication by 1/2 to these 
computations, we define them “addition” in this paper. We estimate the com- 
putation amount of “addition” as follows: 

— Modular addition and subtraction are “1 addition” . 

— Modular multiplication by 2 and 1/2 are “1 addition”. 

— Modular multiplication by 3 and 4 are “2 additions” . 

— Modular multiplication by 8 is “3 additions” . 

Now we consider reducing the number of additions in Algorithm 6 with this 
estimate. For example, computing 4Y^ as (2Y)^ eliminates 1 addition compared 
with computing it as 4 x (Y^). Thus, additions in Algorithm 6 are reduced 
with 2T-based computation. With this technique, we can reduce the number of 
additions in Algorithm 6 by the following techniques: 

(A) At the beginning, compute T/ = 2To as a base value, and compute Y/(= 2T^) 
without computing Yi for i < m. 

(B) By reason of (A), compute Ti = 16 T/ instead of 8T/. 

(C) Compute Si = 4XiY^^,Zi = 2Zi-iYi-i and Ti = 16Td based on Y' = 2Yi, 
viz. compute Si = Xi{Y'Y/Zi = Zi-i{Y'_^) and T = {Y'Y respectively. 

(D) Finally, compute Ym = 

We show the improved routine of Algorithm 6 in Algorithm 7. 

Algorithm 7. Proposed m-repeated elliptic doublings routine, 
input: Elliptic curve point (Aq, Fq, Ao),m and EC domain parameter a. 
output: Elliptic curve point (A^, Ym/Zm) = 2"^(Ao, Fq, Aq). 
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701 Fq' •= 2Fo 

702 TTo ;= aZ^ 

703 Mo := ZXl + Wo 

704 So := Xo{Y^)^ 

705 To := {Yif 

706 Xi := - 2So 

707 Y{ := 2Mo(So - X^) - To 

708 Zi := Y^Zo 

709 for i := 1 to m — 1 

710 Wi~Ti_iWi_i 

711 Mi := + Wi 

712 Si:=Xi{Ylf 

713 Ti-.= {Ylf 

714 Xi+i ;= Mf - 2Si 

715 Y'^^:=2Mi{Si-Xi)-Ti 

716 Zi+, := {Y')Zi 

717 next i 

718 Ym := YJ2 

Table 2 shows the number of multiplications and additions required for the above 
algorithms. Our method eliminates 2m — 2 multiplications and 5m — 2 additions 
compared with the m times elliptic doublings specified in the IEEE PI 363 draft. 



Table 2. Number of multiplications and additions. 



m-repeated elliptic doublings 


Multiplication 


Addition 


Algorithm 5 (IEEE P1363 draft) 


10m 


13m 


Algorithm 6 


8m + 2 


14m — 1 


Algorithm 7 (Proposed) 


8m + 2 


8m + 2 



4 Implementation 

4.1 DSP and Development Tools 

Eor the implementation, we used the DSP TMS320C6201 [18] (Texas Instru- 
ments). The DSP consists of eight parallel-operation functional units including 
two 16-bit multiplication units, and has a performance of 1600 MIPS at 200 
MHz. The instruction processing system is of the VLIW/pipeline type and can 
execute conditional operations. And the maximum instruction code size is 64 
Kbytes. 

As the development tools, an assembler and C compiler are provided. We 
implemented arithmetic routines such as modular multiplication, addition, and 
subtraction in assembly language. Their performance greatly affects the total 
performance, because they are performed frequently. Other routines were written 
in C for easy implementation. 
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4.2 Implementation of RSA and DSA 

We used the following methods: 

— Modular multiplication with the Montgomery multiplication method [14] 
described in section 2. 

— Modular exponentiation with m-ary method [7] for m = 4. 

4.3 Implementation of ECC 

We used following methods: 

— Modular multiplication with the Montgomery multiplication method [14] 
described in section 2. 

— Fast elliptic doubling with the method described in section 3, combined with 
the technique for increasing speed in case EC domain parameter a = 0. 

— Elliptic addition based on IEEE P1363 draft [17]. 

— The base point exponentiation with fixed-base comb method [11], specified 
using two 5-bit precomputed tables. 

— Random point exponentiation in combination with sliding-window exponen- 
tiation [11] with a 4-bit precomputed table and signed-binary [7] of the ex- 
ponent. 

4.4 Code Size 

We implemented RSA, DSA and ECDSA based on above method, and the total 
instruction code size was 41.1 Kbytes. Since TMS320C6201 allows a maximum 
instruction code size of 64 Kbytes, this implementation can deal with RSA, DSA 
and ECDSA without reloading. 

4.5 Performance of RSA, DSA, and ECC 

Table 3 shows the performance of the RSA and DSA implementation. Table 4 
shows the performance of the ECC implementation including the exponentiation 
on a random point. We measured the 100 times average clocks and figured the 
running time at 200 MHz. 

In Table 3, we used e = 2^^ + 1 for the RSA verification key, and Chinese 
remainder theorem for RSA signing. 

In Table 4, the exponent of a random point has a same length as that of EC 
domain parameter p. The ECDSA scheme is based on the IEEE P1363 draft. 
Table 4 also shows the bit length of the order of the base point which affects the 
performance of ECDSA. 



Table 3. Performance of RSA and DSA @ 200 MHz. 





RSA 


DSA 


1024bit 


2048 bit 


512 bit 


1024 bit 


Sign 


11.7 msec 


84.6 msec 


2.62 msec 


7.44 msec 


Verify 


1.2 msec 


4.5 msec 


4.82 msec 


14.5 msec 
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Table 4 . Performance of ECC @ 200 MHz. 





EC domain parameter p 


160-bit 


192-bit 


239-bit 


a ^ 0 


Order of the base point 


151-bit 


192-bit 


239-bit 


Exponentiation on a random point 


3.09 msec 


4.64 msec 


8.47 msec 


ECDSA sign 


1.13 msec 


1.67 msec 


2.85 msec 


ECDSA verify 


3.97 msec 


6.28 msec 


11.2 msec 


a = 0 


Order of the base point 


160-bit 


185-bit 


232-bit 


Exponentiation on a random point 


2.88 msec 


4.15 msec 


7.60 msec 


ECDSA sign 


1.09 msec 


1.50 msec 


2.66 msec 


ECDSA verify 


3.78 msec 


5.50 msec 


9.78 msec 



5 Conclusion 

We proposed fast implementation methods of Montgomery multiplication and 
m-repeated elliptic doublings, which are efficient for any EC domain parameters 
and suitable for the server systems. Our methods are efficient not only for DSP, 
but also for any other processors. 

Construction of our Montgomery multiplication method is suitable for the 
implementation on various pipeline processors. Furthermore, our method is also 
effective for the implementation on non-pipeline processors, because it computes 
all carries within a single-precision value. 

Our m-repeated elliptic doublings method eliminates 2m — 2 multiplications 
and 5m— 2 additions compared with m times elliptic doublings specified in IEEE 
PI 363 draft. This method is efficient on any processors. As the multiplication is 
faster in comparison with addition, our method is more effective. 

We implemented RSA, DSA and ECC with our method on the latest DSP 
TMS320C6201 (Texas Instruments). The performance is 11.7 msec for 1024-bit 
RSA signing, 14.5 msec for 1024-bit DSA verification and 3.97 msec for 160-bit 
ECDSA verification. 
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Abstract. The smart card has been suggested for personal security in 
public key cryptosystems. As the size of the keys in public key cryp- 
tosystems is increased, the design of crypto-controllers in smart cards 
becomes more complicated. This paper proposes a secure device in a ter- 
minal, “Secure Module”, which can support precomputation technique 
for Schnorr-type cryptosystems such as Schnorr [7], DSA [1], KCDSA 
[5]. This gives a simple method to implement secure public key crypto- 
systems without technical efforts to redesign a cryptographic controller 
in 2bmm^ smart card ICs. 

Keywords : Secure module in terminal, smart cards, precomputation, 
digital signature/identification, public-key cryptosystem 

1 Introduction 

Public key cryptography proposed by Diffie and Heilman in 1976 allowed users 
to communicate securely without sharing secret information beforehand. These 
techniques can provide secure communication in today’s open systems with pri- 
vacy and message authentication. With public key techniques, a party has pairs 
of keys: a private key that is known only to the party and a public key available 
to all other users. A certificate for a public key is issued to a user so that any 
other party can verify the owner of the public key. 

The user’s personal security depends on the management of the private key. 
Not only must the private key be stored in a secure memory, but also the compu- 
tation using the private key must be performed in a secure device. Smart cards, 
chip-embedded plastic cards with an MCU, make this possible with their tamper- 
resistant silicon chips. To store user-specific data securely against adversary, the 
EEPROM (Electrically Erasable Programmable Read Only Memory) in the chip 
is used. 

The crypto- controller that uses the MCU and an additional arithmetic copro- 
cessor for modular arithmetic provides the secure computation using the private 
key. 

However, it is difficult to implement a secure and efficient public key cryp- 
tosystem due to the size of chip allowed in the smart card which is specified by 
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International Organization for Standardization (ISO) standard 7816. The pa- 
rameters such as modulus and base integer in Schnorr-type protocol and the 
private key in RSA-type protocol are at least 128 bytes long, and may be more 
than 300 bytes, depending on the algorithm and security required [8] . Therefore, 
the smart card needs an additional arithmetic coprocessor optimized for modular 
exponentiation of long operands working on an 8-bit MCU [6]. Since the mecha- 
nical stability requirements of ISO standard 7816-1 limit the maximum chip size 
of the smart card ICs to 25 mm^, the additional coprocessor for cryptographic 
algorithms must take up the space typically occupied by nonvolatile memory, 
specifically EEPROM. 

Since the modular exponentiation of long operands is a much time-consuming 
job, Schnorr proposes to use precomputation for exponentiation using secret 
random numbers in idle time [7]. This idea can be applied to every schemes based 
on the discrete logarithm problem such as DSA [1], KCDSA [5], etc. With this 
precomputation, the signature generation can be computed by only one modular 
multiplication. However, since the ISO standard 7816 specifies the smart card 
is not supplied with electric power during its idle time, the precomputation 
technique cannot be applied to the smart card. In this paper, we propose a 
concept of “Secure Module (SM)” to design an efficient and secure public key 
cryptosystem without additional crypto-coprocessor. A normal smart card can 
be used as a SM when located in a terminal. With this SM and precomputation 
technique, we can efficiently generate Schnorr-type signatures. 

Section 2 explains the environment of the proposed system and section 3 gives 
an example on how to implement the proposed system with Schnorr scheme. 
Then we consider the security of our proposed system in section 4. 

2 Environment of the Proposed System 

A public key cryptosystem needs a secure place to store the user’s private key. 
The nonvolatile memory, EEPROM, of the smart card is an appropriate space 
for the lengthy private key. To compute the exponentiation of a secret random 
number in signature/identification, we propose to outfit the terminal with a 
tamper-resistant device. We will call this device a Secure Module (SM). 

The hardware specification is depicted in Fig. 1 and the requirements of the 
smart card (SC) and the terminal with secure module are as follows: 

Smart Card (SC) 

A normal IC card with 8-bit MCU, ROM, RAM, EEPROM, and COS (Card 
Operating System) according to ISO standard 7816 satisfies the hardware spe- 
cification of the SC. The user data and cryptographic functions to be stored in 
EEPROM are as follows: 

encryption algorithm, A'( ) : symmetric key ciphering algorithm used in 
sending encrypted private key to SM. 

random number generator, ri( ) : algorithm to generate a secure ran- 
dom number used in mutual authentication with SM. 
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identifier of the SC, ID sc • identifier of the smart card, 
secret key, Ksc • symmetric key for E{ ) used in mutual authentication. 
This is issued when the card is issued for a user and depends on the ID sc 
and the master key Km stored only in SM. 

private key, X : user’s private key for public-key cryptosystem, 
certificate for public key, C : certificate for user’s public key correspon- 
ding to the private key X. 

The certificate C is optional, because, in some system, other users can obtain 
C from the trusted third party (TTP). 




Fig. 1. The hardware specification of terminal with SM 



H£irdw£ire Specification of Terminal 

interface : LCD, keypad, host interface (RS232, modem, floppy, etc), and 
card interface (ISO 7816) 

microprocessor 
memory : RAM, ROM 
secure module 



Secure Module (SM) 

A normal IC card with 8-bit MCU, ROM, RAM, EEPROM, and COS satisfies 
the hardware specification of SM. The user data and cryptographic functions to 
be stored in EEPROM are as follows: 

encryption algorithm, D{) : symmetric key ciphering algorithm, 
random number generator r 2 ( ) : algorithm generating a secure random 
number used in mutual authentication with the SC and signature/identifi- 
cation schemes. 

modular arithmetic functions : executable programs of modular multi- 
plication, modular addition, and modular reduction. 

precomputation functions : executable programs to precompute modu- 
lar exponentiations for secret random numbers. 

master key. Km • symmetric key for ) used in mutual authentication. 
This is issued securely when the terminal is initialized. 
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3 Implementations of a Cryptographic Protocol 

Every signature/identification scheme based on discrete log problem computes 
W = (mod p) for some random number k G {1,2, 1}, where p is a large 

prime and g is an element in {2,...,p — 1} with order g, large prime divisor of 
p—1. Since k does not depend on the message to be signed or the identifier to be 
authenticated, these values can be precomputed in idle time, i.e. when the card 
is not in use. However, since the smart card is not supplied with power when the 
card is not in use, if there is the executable program of precomputation in smart 
card then we can not preprocess the exponentiation. Thus, it is the only time 
when the program of this algorithm is stored in the SM located in the terminal 
that we can preprocess the exponentiation. 

An explanation for the Schnorr scheme is described below [7] : 

Schnorr Signature Scheme for Message M: 

(a) Choose a random number k in {l,...,g — 1} and compute W = g^ (mod p). 

(b) Compute the first signature where R = fi(lT ||M ), where fi( ) is a collision- 
resistant hash function. 

(c) Compute the second signature 5, where S = k X - R (mod q). 

Since (a) does not depend on the message, this step permits precomputation 
of the quantities needed to sign the next messages. We propose to precompute 
{k^W) values in the SM during idle time of the terminal. Since the SM is tamper- 
resistant, the random k^s can be stored securely and the private key X saved in 
SC cannot be exposed. When the message is signed, with W precomputed, only 
one modular multiplication and one modular addition are needed. Since (c) uses 
the user’s private key, the SM must bring the private key securely from the SC. 
A protocol of the Schnorr scheme using the proposed system is described below. 
This protocol is depicted in Fig. 2. 

Protocol to Compute a Signature Using Schnorr Scheme: 

Step 0. Precompute values of {k, W) in SM. 

If there is any empty place in storage, pick a random number A:in{l,...,g— 1} 
and compute W = g^ mod p and store the pair {k^W). Otherwise, wait for 
an interrupt signal that is sent by the terminal when a message arrives. If an 
interrupt signal arrives during the precomputation, SM quits the precompu- 
tation and dumps the intermediate result. 

Step 1. Establish mutual authentication and share a session key between the SC 
and the SM. 

The SC and the terminal must verify whether the other party is legitimate by 
carrying out mutual authentication algorithms. If the SC sends the ID sc to 
the SM, then the SM generates the Ksc using ID sc and Km in a specified 
method. With this shared symmetric key Ksc and encryption algorithm 
), the SC and the SM mutually authenticate and share a session key Ks 
using random numbers as in [2,3]. 
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Step 2. Transfer the enerypted private key from the SC to the SM. 

The SC encrypts the private key X with the session key Ks, denote 

and sends the ciphertext to the SM. The SM decrypts the ciphertext Ekq (^) 

with Ks^ and gets the private key X. 

Step 3. Compute signature values in SM. 

The SM computes signature R and S using secret random number 
precomputed value iV and message M as follows: 

Step 3-1. Retrieving a precomputed pair (A:, W). 

Step 3-2. Compute R = h{W\\M). 

Step 3-3. Compute S = k ^ X - R (mod q). 

Step 3-4. Output {R, S) and erase the used pair (A:, IE). 




Fig. 2. The protocol to compute Schnorr-signature using SM 



When the message is long, the hashing in Step 3-2 may be operated in the 
HOST like Fig. 2, because the channel between the terminal and the host has 
generally limited bandwidth. 

We implemented this protocol with the digital signature algorithm, KCDSA 
[5] using the SM in a terminal with MCU 80C320 and the SC with Hitachi 
H8/3102 (8Kbyte EEPROM, 16Kbyte ROM, 512byte RAM). The cryptographic 
programs (modular operations, precomputation functions and random number 
generator) were coded using an 8051 assembler and their executable programs 
were stored in the EEPROM of the SM. We adapted algorithm [4] for exponent 
computation and this algorithm takes 30 seconds for one exponentiation, which 
is rather slow. However, thanks to the precomputation of this time-consuming 
job, the elapsed time to compute a digital signature is one second. Therefore, 
we could efficiently get a Schnorr-type digital signature of without using highly 
complex techniques for ICs. 
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4 Security Consideration 

In our proposed system, each terminal has its own SM and all SM have the 
same master key Km in common. The master key is issued when the terminal 
is initialized. Let us assume that an adversary knows the master key Km- Since 
every one can get the identifier ID sc of any smart card easily, the adversary can 
compute the session key Ksc shared between the smart card and the SM. In 
addition, if he succeeds in tapping or monitoring the transferred bit stream bet- 
ween the smart card and the terminal during mutual authentication and session 
key {Ks) establishment, he can get the session key Ks- Eventually he can get 
the user’s private key X from the encrypted private key Exsi^)- Unfortunately, 
this reason will allown an adversary who knows the master key Km to get the 
private key X stored in any smart card even though the private key is stored 
in the tamper resistant region of the smart card. Consequently, preventing the 
master key form being revealed is very important in our proposed system. 

However, if there is no problem in initializing a terminal, i.e, the master key 
is stored in the tamper resistant region of the terminal without being revealed, 
and if computing of Ksc from the smart card identifier ID sc and master key 
Km is performed in tamper resistant region of SM, then we believe there will be 
no serious security hole in our system. 

Even if an adversary knows the smart card identifier ID sc bat not the 
master key Km-, the attacks using differential analysis, timing attack or fault 
attack nearly do not work because the adversary knows only the values IDsc^ 
Dks{^)i W = mod p, R = h{W\\M) and S = k ^ X - R mod q. 

5 Conclusion 

In this paper, we proposed to use a secure device in a terminal, “Secure Module” , 
which can support precomputation technique for Schnorr-type cryptosystem. 
This gives an easy method to implement personal security without technical 
efforts to design cryptographic controller in 25mm? smart card ICs. We could 
get KCDSA digital signature in one second using the proposed system. 
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Abstract. Montgomery’s modular multiplication algorithm has enabled 
considerable progress to be made in the speeding up of RSA cryptosy- 
stems. Perhaps the systolic array implementation stands out most in the 
history of its success. This article gives a brief history of its implemen- 
tation in hardware, taking a broad view of the many aspects which need 
to be considered in chip design. Among these are trade-offs between area 
and time, higher radix methods, communications both within the cir- 
cuitry and with the rest of the world, and, as the technology shrinks, 
testing, fault tolerance, checker functions and error correction. We con- 
clude that a linear, pipelined implementation of the algorithm may be 
part of best policy in thwarting differential power attacks against RSA. 

Key Words: Computer arithmetic, cryptography, RSA, Montgomery 
modular multiplication, higher radix methods, systolic arrays, testing, 
error correction, fault tolerance, checker function, differential power ana- 
lysis, DPA. 



1 Introduction 

An interesting fact is that the faster the hardware the more secure the RSA 
cryptosystem becomes. The effort of cracking the RSA code via factorization of 
the modulus M doubles for every 15 or so bits at key lengths of around 2^^ bits 
[10]. However, adding the 15 bits only increases the work involved in decryption 
by ((1024+15)/1024)^ per multiplication and so by ((1024+15)/1024)^ per ex- 
ponentiation, i.e. 5% extra! Thus speeding up the hardware by just 5% enables 
the cryptosystem to become about twice as strong without needing any other 
extra resources. Speed, therefore, seems to be everything. Indeed it is essential 
not just for cryptographic strength but also to enable real time decryption of the 
large quantities of data typically required in, for example, the use of compressed 
video. 

On the other side of the Atlantic, the first electronic computer is generally 
recognised to be the Colossus, designed by Tommy Flowers, and built in 1943-4 
at Bletchley Park, England. It was a dedicated processor to crack the Enigma 
code rather than a general purpose machine like the ENIAC, constructed slightly 
later by John Eckert and John Mauchly in Philadelphia. With the former view of 
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history, cryptography has a fair claim to have started the (electronic) computer 
age. Breaking Enigma depended on a number of factors, particularly human 
weakness in adhering to strict protocols, but also on inherent implementation 
weaknesses. 

Timing analysis and differential power analysis techniques [12] show that 
RSA cryptosystems mainly suffer not from lack of algorithmic strength but also 
from implementation weaknesses. No doubt governments worked on such techni- 
ques for many years before they appeared in the public domain and have deve- 
loped sufficiently powerful techniques to crack any system through side channel 
leakage. Is it a coincidence that the US became less paranoid about the use of 
large keys when such techniques were first published? Now that there seems to be 
no significant gain to be made from further improvement of algorithms, the top 
priority must be to prevent such attacks by reducing or eliminating variations 
in timing, power and radiation from the hardware. 

This survey provides a description of the main ideas in the hardware imple- 
mentation of the RSA encryption process with an emphasis on the importance 
of Montgomery’s modular multiplication algorithm [17]. It indicates the main 
publications where the significant contributions are to be found, but does not 
attempt to be exhaustive. The paper discusses the major problems associated 
with space- and time- efficient implementation and reviews their solution. Among 
the issues of concern are carry propagation, digit distribution, buffering, com- 
munication and use of available area. Finally, there are a few remarks on the 
reliability and cryptographic strength of such implementations. 

2 Notation 

An RSA cryptosystem [20] consists of a modulus M, usually of around 1024 bits, 
and two keys d and e with the property that = A mod M. Message blocks A 
satisfying 0 < A < M are encrypted to C = A® mod M and decrypted uniquely 
by A = (7^ modM using the same algorithm for both processes. M = PQ is 
a product of two large primes and e is often chosen small with few non-zero 
bits (e.g. a Fermat prime, such as 3 or 17) so that encryption is relatively fast. 
d is picked to satisfy de = 1 mod (p{M) where (p is Euler’s totient function, 
which counts the number of residue classes prime to its argument. Here <j>{M) = 
{P — 1)(Q — 1) so that d usually has length comparable to M . The owner of the 
cryptosystem publishes M and e but keeps secret the factorization of M and 
the key d. Breaking the system is equivalent to discovering P and which is 
computationally infeasible for the size of primes used. 

The computation of A® modM is characterised by two main processes: mo- 
dular multiplication and exponentiation. Here we really only consider computing 
(Ax 5) modM. Exponentiation is covered in detail elsewhere, e.g. [9], [28]. To 
avoid potentially expensive full length comparisons with M, it is convenient to 
be able to work with numbers A and B which may be larger than the modulus. 
Assume numbers are represented with base (or radix) r which is a power of 2, 
say r = 2^, and let n be the maximum number of digits needed for any number 
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encountered. Here r is determined by the multiplier used in the implementation: 
an rxr multiplier will be used to multiply two digits together. The hardware 
also determines n if we are considering dedicated co-processors with a maximum 
register size of n digits. Generally, M must have fewer bits than the largest 
representable number; how much smaller will be determined by the algorithm 
used. 

Except for the exponent e, each number X will have a representation of the 
form X = XiV^. Here the zth digit Xi often satisfies 0 < < r, yielding a 

non-redundant representation, i.e. one for which each representation is unique. 
The modulus M will always have such a representation. However, in order to 
limit the scope of interactions between neighbouring digits, a wider range of 
digits is very useful. Typically this is given by an extra (carry) bit so that 
digits lie in the range 0 .. 2r— 1. For example, the output from a carry-save adder 
(with r = 2) provides two bits for each digit and so, in effect, is a redundant 
representation where digits lie in the range 0 .. 3 rather than the usual 0 .. 1. Here 
the k-hit architecture means that our adder will probably propagate carries up 
the length of a single digit, providing a “save” part in the range 0 ..r— 1 and a 
“carry” part representing a small multiple of r. A digit x split in this way is 
written x = Xg -\-rXc. In fact, the addition cycle in our algorithms involves digit 
products, so that a double length result is obtained. Hence, with some notable 
exceptions, the carry component regularly consists of another digit and a further 
one or two more bits. In a calculation X ^ A + T, the digit slices can operate 
in parallel, with the jth digit slice computing 

Xj^s '^Xj^Q i ^j,s T ^j—l,c T Vj 

The extra range of Xj given through the carry component keeps Xj from having 
to generate a separate carry which would need propagating. Since only old values 
appear on the right, not new ones, carry propagation does not force sequential 
digit processing. The digit calculations can therefore be performed in parallel 
when there is sufficient redundancy. 

3 Digit Multipliers and Their Complexity 

Early modular multiplication designs treated radix 2, 4 or even 8 separately at a 
gate level. With rapidly advancing technology, these have had to be replaced by 
the generic radix r viewpoint which is now essential for a better understanding 
of the general principles as well as for a modular approach to design and for 
selecting from parametrised designs to make best use of available chip area. To- 
day’s embedded cryptosystems are already using off-the-shelf 32-bit multipliers 
[21] where reduction of the critical path length by one gate makes virtually no 
difference to the speed — and would probably cost too much in terms of addi- 
tional silicon area. These rxr multipliers form the core of an RSA processor, 
forming the digit-by-digit products. 

In the absence of radical new algorithms we need to be aware of complexity 
results for multiplication but prepared to use pre-built state-of-the-art multi- 
pliers which contain years of experience in optimisation and which come with 
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a guarantee of correctness. Practical planar designs are known for multipliers 
which are optimal with respect to some measure of time and area [3], [4], [15], 
[16], [19], [23]. A reasonable assumption which held until recently is that wi- 
res take area but do not contribute noticeably to time. Under such a model, 
AreaxTime^ complexity for a A:-bit multiplication is bounded below by [3] 
and this bound can be achieved for any time in the range log A: to Vk [16]. Such 
designs tend to use the Discrete Fourier Transform and consequently involve 
large constants in their measures of time and area. There are more useful de- 
signs which are asymptotically poorer but perform better if k is not too large. 
The cross-over point is greater than the size of the digits here [15]. So classi- 
cal multiplication methods are preferable. Indeed, for a chip area of around 10^ 
transistors devoted entirely to RSA and containing hardware for a full length 
digit multiplication a^xi^, k = 32 or 64 is the maximum practical since there 
must be space for registers and for other operations such as the modular reduc- 
tion. In the latest technology wires have significant capacitance and resistance 
and there is a requirement from applications as diverse as cell phones and deep 
space exploration for low power circuitry. This requires a different model which 
is more sensitive to wire length and for which results are only just emerging [18]. 

Speed is most easily obtained by using at least n multipliers to perform a 
full length multiplication aixB (or equivalent) in one clock cycle. If we were not 
worried about modular reduction, the carry propagation problem could be taken 
care of by pipelining this row of multipliers (Fig. 1): aibj is then computed by 
the jth multiplier during the i+jth clock cycle, generating a carry which is fed 
into the next multiplier, which computes aibj-^i in the next cycle. 



bj+1 bj bj_i 




■j+1 “"j “"j-l 



Figure 1. A Pipeline of Multipliers for R ^ aiXB. 

Is this set-up fast enough for real-time processing? A realistic measure of 
the speed required for real-time decryption is provided by an assumption that 
the internal bus speed is in the order of one k-hit digit per clock cycle. If the 
/^-bit multiplier operates in one cycle with no internal pipelining then computing 
AxB takes n cycles using n multipliers in parallel in order to compute aiXB in 
one cycle. The throughput is therefore one digit per cycle for a multiplication. 
Unfortunately, since RSA decryption requires 0{nk) multiplications, we may 
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actually need a two dimensional array of multipliers rather than just a row of 
them to perform real-time decryption. 

Of course, there is an immediate trade-off between time and area. Doubling 
the number of digit multipliers in an RSA co-processor allows the parallel pro- 
cessing of twice as many digits and so halves the time taken. This does not cont- 
radict the AreaxTime^ measure being constant for non-pipelined multipliers, 
although it appears to require less area than expected for the speed-up achieved. 
Having two rows of digit multipliers with one row feeding into the other creates 
a pipeline (now with respect to digits of A) which doubles the throughput that 
the complexity rule expects. This indicates that choosing the largest r possible 
for the given silicon area may not be the best policy; a pipelined multiplier or 
several rows of smaller multipliers may yield better throughput for a given area. 

Finally, despite a wish to use well-established multipliers, differential power 
analysis (DPA) attacks on cryptographic products [13] suggest that special pur- 
pose multipliers need to be designed for some RSA applications which contain 
the secret keys, such as smart cards. Briefly, switching a gate consumes more 
power than not doing so. Inputs for which Hamming weights are markedly more 
or less than average could therefore have a power consumption with measurable 
deviation from average and reveal useful information to an attacker. This is true 
of today’s optimised multipliers. 



4 Modular Reduction & the Classical Algorithm 

The reduction of AxB to {AxB) modM can be carried out in several ways 
[11]. Normally it is done through interleaving the addition of aiB with modular 
reductions instead of computing the complete product first. This makes some 
savings in hardware. In particular, it enables the partial product to be kept 
inside an n-digit register without overflow into a second such register. Each 
modular reduction involves choosing a suitable digit q and subtracting qM from 
the current result. The successive choices of digit q can be pieced together to 
form the integer quotient Q = [(AxH)/MJ or a closely related quantity: 

Classical Modular Multiplication Algorithm: 



{ Pre-condition: 0 < A < } 


R := 0 


> 


For i 


= n-1 downto 0 do 


Begin 

R 


:= rxR + a^xB ; 


fli 


:= R div M ; 


R 


:= R - q^xM ; 



End 

{ Post-condition: R = AxB — QxM 

and, consequently, R = {AxB) modM } 
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If we define so that Ai = and use similar notation 

for Qi then it is easy to prove by induction that at the end of the zth iteration 
R = AiXB — QiXM . Hence the post-condition holds. 

Brickell [5] observes that R div M need only be approximated using the top 
bits of R and M in order to keep R bounded above. In particular [24], suppose 
that M_ is the approximation to M given by setting all but the k^3 most signi- 
ficant bits of M to zero, and that is obtained from M by incrementing its 
A:+3rd bit by 1. Then M < M < Assume R is given simi- 

larly by setting the same less significant bits to zero and that the redundancy 
in the representation of R is small enough for R < R-\-M to hold. The appro- 
ximation to Qi which is used is defined by the integer quotient Qi = \_R/ M ' \ . 
Then 

qi = IR/M'\ < [R/M\ = qi 

< 1 + R/M < 1 + (l+2-2r-i)i^/M' < 1 + (l+2-2r-i)(% + 1) 

SO that Qi—Qi < 1 + + gi(l-h4r)“^ from which, at the end of the 

loop, R < {l^qi—qi)M < 2M + (l+4r)“^g^M. Assume the (possibly redundant) 
digits (li are bounded above by a. We will establish inductively that 

R < 3M 

at each end of every iteration of the loop. Using the bound on the initial value of 
R in the loop yields qiM < rR^aB < 3rM +a(l+r(l+3r)“^)i^ so that, from the 
above, the value for R at the end of the loop satisfies R < 2M + (l+4r)“^(3rM + 
a(l+r(l+3r)“^)i^) < 3M + a(l+3r)“^i^, as desired. Naturally, more bits for a 
better approximation to qi will yield a lower bound on R^ whilst fewer will yield 
a worse bound or even divergence. 

The output from such a multiplication can be fed back in as an argument to 
a further such multiplication without any further adjustment by a multiple of 
M providing the bound on R does not grow too much. Assuming, reasonably, 
that redundancy in A is bounded by a < 2r we obtain R < 3M + from which 
i) if > 9M then B is an upper bound for successive modular multiplications 
and indeed the bound decreases towards a limit of 9M, and ii) if 9M is an upper 
bound for the input it is also an upper bound for the output. Hence only a small 
number of extra subtractions at the end of the exponentiation yields M as a 
final upper bound. 

There are no communication or timing problems when there is just one mul- 
tiplier. So, for the rest of this article, we will assume that the hardware for the 
algorithm consists of an array of cells, each one for computing a digit of R^ and 
so each containing two multipliers. The main difficulty is that qi needs to be 
computed before any progress can be made on the partial product R. Scaling 
M to make its leading two digits 10^ makes the computation easier [25], [27], as 
does shifting B downwards to remove the dependency on B. However, qi still has 
to be broadcast simultaneously to every digit position (Fig. 2) and redundancy 
has to be employed in R so that digit operations can be performed in paral- 
lel [5], [27]. These severe drawbacks make Montgomery’s algorithm for modular 
multiplication appear more attractive. 
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Figure 2. Classical Algorithm with Shift Up, Carries and Digit Broadcasting 

5 Montgomery’s Algorithm 

Peter Montgomery [17] has shown how to reverse the above algorithm so that 
carry propagation is away from the crucial bits, redundancy is avoided and si- 
multaneous broadcasting of digits is no longer required. This noticeably reduces 
silicon area and shortens the critical path. Although the complexities of the two 
algorithms seem to be identical in time and space, the constants for Montgo- 
mery’s version are better in practice. Montgomery uses the least significant digit 
of an accumulating product R to determine a multiple of M to add rather than 
subtract. He chooses multiplier digits in the opposite order, from least to most 
significant and shifts down instead of up on each iteration: 

Montgomery’s Modular Multiplication Algorithm: 

{ Pre-condition: 0 < A < } 

R := 0 ; 

For i := 0 to n-1 do 
Begin 

R : = R + aj^ X B ; 

_ A 

qi := (-romo ) mod r ; 

R := (R + q^xM) div r ; 

{ Invariant: 0 < R < M^B } 

End 

{ Post-condition: Rr^ = AxB QxM 

and, consequently, R = (AxBxr~^) modM } 

Here mo~^ is a residue modr satisfying moXmo~^ = 1 modr. Since r is a power 
of 2 and M is odd (because it is a product of two large primes) r and M are co- 
prime, which is enough to guarantee the existence of mo~^. The digit Qi is chosen 
so that the expression RAQiXM is exactly divisible by r. Its lowest digit is clearly 
0. If we define Ai = Yl]=o Qi analogously then Ai = Ai-i-\-aiR and 
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= A. The value of R at the end of the iteration whose control variable has 
value i is easily shown by induction to satisfy = AiXB -j- QiXM because 

the division is exact. Hence the post-condition holds. The digits of A are required 
in ascending order. Thus, they can be converted on-line into a non-redundant 
form and so we may assume < r— 1. This enables the loop invariant bounds 
to be established by induction. 

The extra power of r factor in the output R is easily cleared up by minor 
pre- and post-processing [6]. The easiest way to explain this is to associate with 
every number its Montgomery class mod M , namely 

A = Ar^ mod M 

and to use x to denote the Montgomery modular multiplication. The Montgo- 
mery product of A and B is AxB = AB r~^ = ABr^ = AB mod M . So app- 
lying Montgomery multiplication to A in an exponentiation algorithm is going 
to produce A^ rather than {Ay. Introduction of the initial power of r to obtain 
A is performed using the precomputed value 

R 2 = r^ = r‘^'^ mod M 

and a Montgomery multiplication thus [7]: AXR 2 = Ar^ = A modM. Removal 
of the final extra power of r is also performed by a Montgomery multiplication: 
1 = modM. 

Throughout the exponentiation, an output from one multiplication is used 
as an input to a subsequent multiplication. Without care the outputs will slowly 
increase in size. However, suppose a^_i = 0. Then the bound R < M^B at 
the end of the second last loop iteration is reduced to M^r~^B on the final 
round, which prevents unbounded growth when outputs are used as inputs. In 
particular, if the second argument satisfies B < 2M then the output also satisfies 
R < 2M. Thus, suppose 2rM < r^ .It is reasonable to assume that A and R 2 
are less than 2M, even less than M, so that their topmost digits are both zero. 
Then the scaling of M to M by Montgomery multiplication yields A < 2M and 
this bound is maintained as far as the final output A^. So only a single extra 
subtraction of M may be necessary at the very end to obtain a least non-negative 
residues. 

However, when all the I/O is bounded by 2M, an interesting and useful 
observation about the output R of the final multiplication A^ x 1 = A^ mod M 
can be derived from the post-condition of the modular multiplication, namely 
Rrn _ QM. Q has a maximum value of r^ — 1. Hence, A^ < 2M would 

lead to the output satisfying Rr^ < and so to R < M, whilst a 

sub-maximal value for Q immediately yields R < M in the same way. Hence 
a final subtraction is only necessary when R = M, i.e. when A^ = 0 modM, 
that is, for M = 0 modM. It is entirely reasonable to assume that this never 
occurs in the use of the RSA cryptosystem as it would require starting with 
Z = M, whereas invariably the initial value should be less than M. Moreover, 
each modular multiplication in the exponentiation would also have to return M 
rather than 0 to prevent all subsequent operands becoming 0. Hence the final 
subtraction need never occur after the final re-scaling of an exponentiation. 
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Computation of qi is a potential bottleneck here. Simplification may be achie- 
ved in the same two ways that applied to the classical algorithm above: this time 
scale M to make mo = — 1 and shift B up so that 6 q = 0. These would make 
q- = ro, avoiding any computation at all. We consider the shift first, supposing 
initially that B < 2M . If the shift B ^ rB is added to the start of the multipli- 
cation code, then the loop invariant becomes R < M-\-rB. Hence we require the 
top two digits of A to be zero in order for the bound on R to be reduced first 
to M^B and then to M^r~^B^ so that R < 2M is output. If we always have 
A < 2M then this is achieved if 2r‘^M fits into the hardware registers of n digits. 
The cost of this shift is hidden by the definition of n. The shift can be hardwired 
and counted as free, as can a balancing adjustment of the output through a hig- 
her power of r in i? 2 - However, there is an extra iteration of the loop: before we 
had one more iteration than digits in M, now we have two more. Fortunately, the 
cost of one more iteration per multiplication is low compared with the delay on 
every iteration which computing qi may cause. Apart from the extra storage cost 
of the scaled number, scaling M has a similar cost, namely one more iteration 
per multiplication: M is replaced by (r— mo“^)M, which increases its number of 
digits by 1 [26]. However, at the end of the exponentiation, the original M also 
needs to be loaded into the hardware and some extra subtractions of M may be 
necessary to reduce the output from a bound of 2rM . 



6 Digit-Parallel and Digit-Serial Implementations 



To overcome the problems of carry propagation in the classical algorithm, re- 
dundancy and extra hardware for digit broadcasting were required. Here, too, 
the same methods enable parallel digit processing. Indeed, if the shift direction 
were reversed, the diagram of Fig. 2 would cover Montgomery’s algorithm also. 

For both algorithms, define the zth value of R^ written R^^^ = 
to be the value immediately before the shift is performed. This is calculated at 
time 2i in the parallel digit implementations, with q^ computed and broadcast 
at time 2i+l, say. The jth cell operates on the jth digits, transforming the i— 1st 
value of R into the zth value. A common view of this process is: 



(d 

3 



(i-1) , (i-1) , , I 



where the choice of signs is — for the classical algorithm and + for Montgomery’s. 
(The input values of R on the right are partitioned into save and carry/borrow 
parts.) 

A restriction to only nearest neighbour communication is desirable because 
of the delays and wiring associated with global movement of data. For Mont- 
gomery’s algorithm, a systolic array makes this possible [26]. In this, the cells 
are transformed into a pipeline in which the jth cell computes at time 2i+j 
(Fig. 3). The input is calculated on the preceding cycle by cell j+1 and 

a carry from is computed in cell j — 1, also in that cycle. This means 




Montgomery’s Multiplication Technique: How to Make It Smaller and Faster 



89 



that carries can be propagated and so the cell function can become: 

(i) I (i) , (^ — 1) I (b I Li 

M + rc\ ^ ^ + aibj + qinij 

where the digits are now in the standard, non- redundant range 0 .. r— 1. (The 
different notation for the carries recognises that they do not form part of the 
value of unlike in the carry-save view.) If the digit qi is produced at time 2i 
in cell 0, it can be pipelined and received from cell j — 1 at time 2i-\-j — l ready for 
the calculation. This pipeline can be extended to part or all of a 2-dimensional 
array with n rows which computes iterations of the loop in successive rows. 




Figure 3. Pipelined Montgomery Multiplication: I/O for cell j at time 2i-\-j. 

The lowest cell, cell 0, computes only the quotient digit qi. Digits of index 0 
are always discarded by the shift down and so do not need computing; the lowest 
digit of the final output is shifted down from index 1. We have + 

o.iXho)mQ~^ modr. This can indeed be calculated at time 2i because its only 
timed input r\ ^ is computed at time 2i— 1. Observe that pre-computation 
of 6oXmo“^ reduces the computation of to a single digit multiplication and 
an addition, giving lower complexity for cell 0 than for the other cells. Hence, 
computing g^ no longer holds up the multiplication. Instead, the critical path 
lies in the repeated, standard cell. 

The communication infrastructure is less here than for the parallel digit ope- 
rations illustrated in Fig. 2. Although the number of bits transmitted is almost 
the same in both cases and is independent of n, the parallel digit set-up requires 
an additional O(logn) depth network of multiplexers to distribute the digit g^. 
Here the inputs and output are consumed, resp. generated, at a rate of one digit 
every other cycle for A and and one digit every cycle in the case of B. Unlike 
the parallel digit model, this is very convenient for external communication over 
the bus, reducing the need for buffering or increased bandwidth. 

When one multiplication has completed, another one can start without any 
pause. However, the opposing directions of carry propagation and shift mean 
that each cell is idle on alternate cycles. Thus, full use of the hardware requires 
two modular multiplications to be processed in parallel. The normal square and 
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multiply algorithm for exponentiation can be programmed to compute squares 
nose-to-tail starting loop iterations on the even cycles and interleave any ne- 
cessary multiplications to start on the odd cycles. This enables an average 75% 
take-up of the processing power, but has some overhead in storage and switching 
between the two concurrent multiplications. Overall, with this added complexity, 
the classical, parallel digit, linear array might be faster for small n, but for lar- 
ger n and/or smaller r the broadcasting problem for qi means that a pipelined 
implementation of Montgomery’s algorithm should be faster. 

In [14] Peter Kornerup modified this arrangement, pairing or grouping digits 
in order to reduce the waiting gap. In effect he alters the angle of the timing 
front in the data dependency graph and, in the case he illustrates, he uses half 
the number of cells with twice the computing power. This can be advantageous 
in some circumstances. 

An idea of the current speed of such array implementations is given in Blum 
and Paar [1] and amongst those actually constructed is one by Vuillemin et al. 
[ 22 ] 

7 Data Integrity 

Correct functioning is important not only for critical applications but as a protec- 
tion against, for example, attacks on RSA signature schemes through single fault 
analysis [2]. Moreover, it is difficult and expensive to check all gate combinations 
for faults at fabrication time because of the time need to load sufficiently many 
different moduli [29] and, when smart cards are involved, a low unit price may 
only be possible by using tests which occasionally allow sub-standard products 
through to the market. 

However, run-time checker functions are possible. These can operate in a si- 
milar way to those for multiplication in current chips. For example, results there 
are checked mod 3 and mod 5 in one case [8]. Here the cost of a similar check 
is minimal compared to that of the total hardware. The key observation is that 
the output from the modular multiplication algorithm satisfies an arithmetic 
equation: 



R = AxB-QxM or Rr^ = AxB^QxM 

These are easily checked modm for some suitable m by accumulating partial 
results for both sides on a digit- by- digit basis as the digits become available. A 
particularly good choice for m is a prime just above the maximum cell output 
value, but smaller m prime to r are also reasonable. The hardware complexity 
for this is then equivalent to about one cell in the linear array and so the cost is 
close to that of increasing n by 1. 

If a discrepancy is found by the checker function, the computation can be 
aborted or re-computed by a different route. For example, to avoid the problem, 
M might be replaced by clM for a digit d prime to r and combined as necessary 
with some extra subtractions of the original M at the end. 
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8 Timing and Power Attacks 

The literature contains descriptions of a number of attacks on the RSA crypto- 
system which use timing or power information from hardware implementations 
which contain secret keys [12], [13]. Experience of implementing both the classical 
and Montgomery algorithms for modular multiplication suggests that most opti- 
misation techniques which work with one also apply to the other. This suggests 
that attacks which succeed on well-designed implementations of the classical al- 
gorithm will have equivalents which apply to implementations of Montgomery’s 
algorithm. 

However, an important difference arises when the pipelined linear array is 
used since, judging from the data dependency graph, there seems to be no equi- 
valent for the classical algorithm. With parallel digit processing of the multipli- 
cation AxB modM, the same digits of A and Q are used in every digit slice, 
opening up the possibility of extracting information about both by averaging 
the power consumption over all cells. However, during the related Montgomery 
multiplication in a pipelined array, many digits of A and Q are being used si- 
multaneously for forming digit products. This should make identification of the 
individual digits much more difficult, and certainly increases the difficulty of any 
analysis. 



9 Conclusion 

We have reviewed and compared the main bottlenecks which may arise in hard- 
ware for implementing the RSA cryptosystem using both the classical algorithm 
for modular multiplication and Montgomery’s version, and shown how these are 
solved. The hardware still suffers from broadcasting problems with the classical 
algorithm and scheduling complications with Montgomery’s. However, as far as 
implementation attacks using power analysis are concerned, the pipelined array 
for the latter seems to offer considerable advantages over any other implemen- 
tations. 
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Abstract. This paper describes the methodology and design of a scala- 
ble Montgomery multiplication module. There is no limitation on the 
maximum number of bits manipulated by the multiplier, and the sel- 
ection of the word-size is made according to the available area and/or 
desired performance. We describe the general view of the new architec- 
ture, analyze hardware organization for its parallel computation, and 
discuss design tradeoffs which are useful to identify the best hardware 
configuration. 



1 Introduction 

The Montgomery multiplication algorithm [10] is an efficient method for modular 
multiplication with an arbitrary modulus, particularly suitable for implementa- 
tion on general-purpose computers (signal processors or microprocessors). The 
method is based on an ingenious representation of the residue class modulo M, 
and replaces division by M operation with division by a power of 2. This opera- 
tion is easily accomplished on a computer since the numbers are represented in 
binary form. Various algorithms [11,7,1] attempt to modify the original method 
in order to obtain more efficient software implementations on specific processors 
or arithmetic coprocessors, or direct hardware implementations. In this paper we 
are interested in hardware implementations of the Montgomery multiplication 
operation. 

Several algorithms and hardware implementations of the Montgomery multi- 
plication for a limited precision of the operands were proposed [1,11,3]. In order 
to get improved performance, high-radix algorithms have also been proposed [11, 
8] . However, these high-radix algorithms usually are more complex and consume 
significant amounts of chip area, and it is not so evident whether the complex 
circuits derived from them provide the desired speed increase. A theoretical in- 
vestigation of the design tradeoffs for high-radix modular multipliers is given in 
[15]. An example of a design in radix-4 is shown in [13]. The increase in the radix 
forces the use of digit multipliers, and therefore more complex designs and longer 
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clock cycle times. For this reason, low-radix designs are usually more attractive 
for hardware implementation. 

The Montgomery multiplication is the basic building block for the modular 
exponentiation operation [4,5] which is required in the Difhe-Hellman and RSA 
public-key cryptosystems [2,12]. Currently, most modular exponentiation chips 
perform the exponentiation as a series of modular multiplications, which is the 
most compelling reason for the research of fast and inexpensive modular mul- 
tipliers for long integers. Recent implementations of the Montgomery multipli- 
cations are focused on elliptic key cryptography [9] over the finite field GF(p). 
The introduction [6] of the Montgomery multiplication in GF(2^) opened up 
new possibilities, most notably in elliptic key cryptography over the finite field 
GF{2^) and discrete exponentiation over GF{2^) [5]. 

In this paper, we propose a scalable Montgomery multiplication architecture, 
which allows us to investigate different areas of the design space, and thus, 
analyze the design tradeoffs for the best performance in a limited chip area. We 
start with a short discussion of the scalability requirement which we impose in 
our design, and then give a presentation of the general theoretical issues related 
to the Montgomery multiplication. We then propose a word-based algorithm, 
and show the parallel evaluation of its steps in detail. Using this analysis, we 
derive an architecture for the modular multiplier and present the design of the 
module. We also perform simulations in order to provide area/ time tradeoffs and 
give a first order evaluation of the multiplier performance for various operand 
precision. 



2 Scalability 

We consider an arithmetic unit as scalable if 

the unit can be reused or replicated in order to generate long-precision 
results independently of the data path precision for which the unit was 
originally designed. 

For example, a multiplier designed for 768 bits [13] cannot be immediately used 
in a system which needs 1,024 bits. The functions performed by such designs 
are not consistent with the ones required in the larger precision system, and 
the multiplier must be redesigned. In order to make the hardware scalable, the 
usual solution is to use software and standard digit multipliers. The algorithms 
for software computation of Montgomery’s multiplication are presented in [6,7]. 
The complexity of software-oriented algorithms is much higher than the comple- 
xity of the radix-2 hardware implementation [1], making a direct hardware im- 
plementation not attractive. In the following, we propose a hardware algorithm 
and design approach for the Montgomery multiplication that are attractive in 
terms of performance and scalability. 
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3 Montgomery Multiplication 

Given two integers X and F, the application of the radix-2 Montgomery multi- 
plication (MM) algorithm with required parameters for n bits of precision will 
result in the number 



Z = MM(A, Y) = XYr-^ mod M , (1) 

where r = 2'^ and M is an integer in the range 2'^~^ < M < 2'^ such that 
gcd(r, M) = 1. Since r = 2^^, it is sufficient that the modulus M is an odd integer. 
For cryptographic applications, M is usually a prime number or the product of 
two primes, and thus, the relative primality condition is always satisfied. The 
Montgomery algorithm transforms an integer in the range [0, M — 1] to another 
integer in the same range, which is called the image or the M-residue of the 
integer. The image or the M -residue of a is defined as a = ar mod M. It is easy 
to show that the Montgomery multiplication over the images a and b computes 
the image c = MM(a, b) which corresponds to the integer c = ab mod M [7]. The 
transformation between the image and the integer set is accomplished using the 
MM as follows. 

— From the integer value to the M-residue: a = MM(a, r^) = ar‘^r~^ mod M = 
ar mod M . 

— From the M -residue to the integer value: a = MM(a, 1) = arr~^ mod M = 
a mod M . 

Provided that r (mod M) and (mod M) are precomputed and saved, we 
need only a single MM to perform either of these transformations. The tradeoff 
is the lower complexity of the MM algorithm when compared to the conventional 
modular multiplication which requires a division operation. Another important 
aspect of the advantage of the MM over the conventional multiplication is ex- 
posed in modular exponentiation, when multiple MMs are computed over the 
M -residues before the result is translated back to the original integer set. 

The radix-2 Montgomery multiplication algorithm for m-bit operands X = 
(x^_i, xo), T, and M is given as: 

The Radix- 2 Algorithm 

for i = 0 to m — 1 

if {Si ^ XiY) is even 

then Sij^i := {Si -h x^T)/2 
else 5^+1 := {Si+XiY +M)/2 

if Sjn ^ M then Sm := Sm — M — the final correction step 

This algorithm is adequate for hardware implementation because it is composed 
of simple operations: word- by-bit multiplication, bit-shift (division by 2), and 
addition. The test of the even condition is also very simple to implement, consi- 
sting on checking the least significant bit of the partial sum Si to decide if the 
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addition of M is required. However, the operations are performed on full preci- 
sion of the operands, and in this sense, they will have an intrinsic limitation on 
the operands’ precision. Once a hardware is defined for m bits, it cannot work 
with more bits. 

4 A Multiple- Word Radix-2 Montgomery Multiplication 
Algorithm 

The use of short precision words reduces the broadcast problem in the circuit 
implementation. The broadcast problem corresponds to the increase in the pro- 
pagation delay of high- fanout signals. Also, a word-oriented algorithm provides 
the support we need to develop scalable hardware units for the MM. Therefore, 
an algorithm which performs bit-level computations and produces word-level ou- 
tputs would be the best choice. Let us consider u?-bit words. For operands with 
m bits of precision, e = \{m + l)/w;] words are required. The extra bit used in 
the calculation of e is required since it is known that S (internal varible of the 
radix 2 algorithm) is in the range [0, 2M — 1], where M is the modulus. Thus the 
computations must be done with an extra bit of precision. The input operands 
will need an extra 0 bit value at the leftmost bit position in order to have the 
precision extended to the correct value. 

We propose an algorithm in which the operand Y (multiplicand) is scan- 
ned word-by-word, and the operand X (multiplier) is scanned bit-by-bit. This 
decision enables us to obtain an efficient hardware implementation. We call it 
Multiple Word Radix-2 Montgomery Multiplication algorithm (MWR2MM). We 
make use of the following vectors: 

M = , 

JC — i Xi^ Xq ) , 

where the words are marked with superscripts and the bits are marked with sub- 
scripts. The concatenation of vectors a and b is represented as (a, b). A particular 
range of bits in a vector a from position i to position j, j > i is represented as 
The bit position i of the word of a is represented as The details 
of the MWR2MM algorithm are given below. 

The MWR2MM Algorithm 
5 = 0 — initialize all words of S 

for i = 0 to m — 1 

if = 1 then 

for j = 1 to e — 1 

(C, 5(^')) := C + 

S-O-i) 




98 



A.F. Tenca and C.K. Koc 



else 

for j = 1 to e — 1 

(C',S'W)) := C + + S^^'I 

SU-Y 

The MWR2MM algorithm computes a partial sum S for each bit of X , scanning 
the words of Y and M. Once the precision is exhausted, another bit of X is 
taken, and the scan is repeated. Thus, the algorithm imposes no constraints to 
the precision of operands. The arithmetic operations are performed in precision 
w bits, and they are independent of the precision of operands. What varies is 
the number of loop iterations required to accomplish the modular multiplication. 
The carry variable C must be in the set {0, 1,2}. This condition is imposed by the 
addition of the three vectors 5, M, and XiY . To have containment in the addition 
of 3 w-bit words and a maximum carry value C^ax (generated by previous word 
addition), the following equation must hold: 

3(2- - 1) + < C^,,2- + 2- - 1 

which results in C^ax > 2. Thus, choosing C^ax = 2 is enough to satisfy the 
containment condition. The carry variable C is represented by two bits. 

5 Parallel Computation of the MWR2MM 

In this section we analyze the data dependencies on the proposed algorithm 
(MWR2MM) giving more information on the its potential parallelism and inve- 
stigating parallel organizations suitable for its implementation. 

The dependency between operations within the loop for j restricts their par- 
allel execution due to dependency on the carry - C. However, parallelism is 
possible among instructions in different i loops. The dependency graph for the 
MWR2MM algorithm is shown in Figure 1. Each circle in the graph represents 
an atomic computation and is labeled according to the type of action performed. 
Task A corresponds to three steps: (1) test the least significant bit of S to de- 
termine if M should be added to S during this and next steps, (2) addition of 
words from S\ and M (depending on the test performed), and (3) one-bit 
right shift of a 5 word. Task B corresponds to steps (2) and (3). We observe 
from this graph that the degree of parallelism and pipelining can be very high. 

Each column in the graph may be computed by a separate processing element 
(PE), and the data generated from one PE may be passed to another PE in a 
pipelined fashion. An example of the computation executed for 5-bit operands 
is shown in Eigure 2 for the word size of re = 1 bit. Since the word of each 
input operand is used to compute word j — 1 of 5, the last B task in each column 
must receive = T(®) = 0 as inputs. This condition is enough to guarantee 

that will be generated based only on the internal PE information. Note 

also that there is a delay of 2 clock cycles between processing a column for Xi 
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Fig. 1. The dependency graph for the MWR2MM Algorithm. 



and a column for The total execution time for the computation shown in 

Figure 2 is 14 clock cycles. 

Tasks A and B are performed on the same hardware module. The local control 
circuit of the module must be able to read the least significant bit of at the 
beginning of the operation, and keep this value for the entire operand scanning. 
Recall that the even condition of determines if the processing unit should 
add M to the partial sum during the pipeline cycle. The pipeline cycle is the 
sequence of steps that a PE needs to execute to process all words of the input 
operands. 

The maximum degree of parallelism that can be attained with this organiza- 
tion is found as 



Pmax 



e + 1 

2 



( 2 ) 
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Fig. 2. An example of computation for 5-bit operands, where w = 1 bit. 

It is easy to see from Figure 2 that Prnax = 3. When less than Prnax units are 
available, the total execution time will increase, but it is still possible to perform 
the full precision computation with the smaller circuit. Figure 3 shows what 
happens when only 2 processing modules are used for the same computation 
shown in Figure 2. In this case, the computation during the last pipeline cycle 
wastes one of the stages, because m is not a multiple of 2. 

The total computation time T (in clock cycles) when n < Pmax modules 
(stages) are used in the pipeline is 

711 - 1 - 1 

T= (e + l)-l + 2(n-l) (3) 

where the first term corresponds to the number of pipeline cycles (|~(m + l)l^a\) 
times the number of clock cycles required by a pipeline stage to compute one full- 
precision operand, and the last term corresponds to the latency of the pipeline 
architecture. With n units, the average utilization of each unit is found as 

, , Total number of time slots per bit of A x m m(e + 1) , ,, 

U = = — ^ . (4) 

Total number of time slots x n Tn 

Figure 4 shows the hardware utilization I/, total computation time and spee- 
dup of 2 or 3 units over one unit, for a small range of the precision, and word size 
w = 8 bits. We can see that the overhead of the pipelined organization becomes 
insignificant for precision m > 3w. We can attain a speedup very close to the 
optimum for even small number of operand bits. 
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Fig. 3. An example of computation for 5-bit operands with two pipeline stages. 



6 Design of the Scalable Architecture 

A pipeline with 2 computational units is shown in Figure 5. One aspect of this 
organization is the register file design. Since the data is received word-serially 
by the kernel, the registers must work as rotators (circular shift registers) in 
some cases and shifters in other cases. The registers which store V and M work 
as rotators. The processing elements itself must relay the received digits to the 
next unit in the pipeline. All paths are w bits wide, except for the Xi inputs 
(only 1 bit). The values of Xi come from a ^shift register, where p equals to the 
number of processing elements in the pipeline. The register for S must be a shift 
register, since its contents is not reused. The length (L) of the shift register for 
S values depends on the number of words (e) and the number of stages (n) in 
the pipeline. This length is determined as: 

f e + 2 - 2n if (e + 2) > 2n . . 

\ 0 otherwise ^ ^ 

Observe that these registers will not consume more than what is normally 
used in a conventional radix-2 design of the MM. These registers can be easily 
implemented by connecting one memory element to another in a chain (or loop) , 
which will not impact the clock cycle of the whole system. Since we also need 
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Total Execution Time 




Module Utilization 




Operand Precision (bits) 



Speedup over a single PE 





.1 


2PEs 


1 









Operand Precision (bits) 



2.5 



1.5 



0.5 



Fig. 4. The performance figures for multiple units with w = S bits. 



loading capability for the rotator, multiplexers (MUXes) should be used between 
certain memory elements in the chain. The delay imposed by these MUXes will 
not create a critical path in the final circuit. To avoid having too many MUXes, 
we may load M and Y serially, during the last pipeline cycle of the algorithm. 
In this case, MUXes are required between two memory elements of the rotator 
only (not between all of the memory elements). 

The global control block was not included in this figure for simplicity. Its 
function can be inferred from the dependency graph and the algorithm already 
presented. The shaded box represent flip flops. 



X 




Fig. 5. Pipelined organization with 2 units. 
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6.1 Processing Element 

The block diagram of the processing element is shown in Figure 6. The data 
path receives the inputs from the previous stage in the pipeline, and computes 
the next digits serially. The inputs are delayed one extra clock cycle before 
they are sent to the next stage. 




control 

signals 



to interstage 
latch of the 
pipeline 



Isbit of 

control 

signals 



Fig. 6. The block diagram of the processing unit. 



To reduce storage and arithmetic hardware complexity, we consider that M, 
A, and Y are available in non-redundant form. The internal sum S is received and 
generated in the redundant Carry-Save form. In this case, 2w bits per word are 
transferred between units in each clock cycle. The data path also makes available 
the information on the least significant bit it) of the computation 
which is the first computation step performed by the data path in each pipeline 
cycle. Only the value t obtained when the least significant digits of Y and S 
come into the unit should be used to control the addition of M (control signal 
c). The local control is responsible for storing the t value during the pipeline 
cycle, and also relay some control signals to the downstrem modules. 

The design of the data path follows the idea presented in [14] modified for 
least-significant-digit-first type of computation. The basic organization of the 
data path consists of two layers of carry-save adders (CSA). Assuming a full- 
precision structure as in Figure 7(a), we propose the retiming process shown for 
the case u? = 1 to generate the serial circuit design presented in Figure 7(b). 
For u? > 1, larger groups of adders are considered, based on the same approach. 
Notice that the cycle time may increase for larger u? as a result of the broadcast 
problem only; it will not depend on the arithmetic operation itself. The high- 
fanout signals in the design are Xi and c, and both change value only once for 
each pipeline cycle. Observe that the bit -right- shift that must be performed by 
the data path is already included in the CSA structure shown in the Figure. 
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( a) full-precision adder structAre (b) radix- 2 serial adder structure 



Fig. 7. The serial computation of the MM operations. 

The data path design for the case u? = 3 is shown in Figure 8. It has a more 
complicated shift and alignment section to generate the next S word. When 
computing the bits of word j (step j), the circuit generates w — 1 bits of 
and the most significant bit of The bits of computed at step j — 1 

must be delayed and concatenated with the most significant bit generated at 
step j (alignment). 




Fig. 8. PE’s data path for ic = 3 bits. 
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7 Area/Time Tradeoffs 

After describing the general building block for the implementation of our scalable 
MM architecture, we discuss the area/time tradeoffs that arise for different values 
of operand precision m, word size re, and the pipeline organization. The area A 
is given as a design constraint. In this analysis, we do not consider the wiring 
area. For a first order approximation we consider that the propagation delay of 
the processing element is independent of w (this hypothesis is reasonable when 
w is small). This assumption implies that the clock cycle is the same for all 
cases and the comparison of speed among different designs can be made based 
on clock cycles. The area used by registers for the intermediate sum, operands 
and modulus is the same for all designs. 

It is clear that the proposed scheme has the worst execution time for the case 
re = m, since some extra cycles were introduced by the computational unit in 
order to allow word-serial computation, when compared to other full-precision 
designs. Thus, we will consider the case when the available chip area is not 
sufficient to implement a full-precision conventional design. The performance 
evaluation resumes to the question: 

What is the best organization for the scalable architecture for a given 
area? 

We used VHDL on the Mentor graphics tools to synthesize the circuit with the 
1.2/im CMOS technology. The cell area for a given word size w is obtained as 

= 47.2u? , 

where the value 47.2 is the area cost provided by the tool (a 2-input NAND gate 
corresponds to 0.94). When using the pipelined organization, the area of each 
inter-stage latch is important, and was measured as Aiatchi'^) = SA2w. The area 
of a pipeline with n units is given as 

= (n- l)Aiatch{^^^) = 55.52nu? - 8.32u? . (6) 

The maximum word size that can be used in the particular design {w^ax) is a 
function of the available area A and the number of pipeline stages n. It is found 
as 



55.52nw - 8.32w < A 

w < 



A 



55.52n - 8.32 
A 

55.52n-8.32 



( 7 ) 



Based on Wmax^ we obtain the total execution time (in clock cycles) for operands 
with precision m from Equation 3, as follows: 



T(rn^ A, n) = 



m + 1 


( 


m + 1 


n 


\ 


'^max {A^ Ti) 



+ 1 -l + 2(n-l) . 



(8) 
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For a given area A, we are able to try different organizations and select the 
faster one. The graph given in Figure 9 shows the computation time for various 
pipeline configurations for A = 20,000. The number of stages that provides the 
best performance varies with the precision required in the computation. For the 
cases shown, 5 stages would provide good performance. We don’t want to have 
too many stages for two reasons: (1) high utilization of the processing elements 
will be possible only for very high precision and (2) the execution time may 
have undesirable oscillations (as shown in the rightmost part of the curve for 
m = 1024). The behavior mentioned in (2) is the result of (i) word size w is not 
a good divisor for m, producing one word (most significant) with few significant 
bits, and (ii) there is not a good match between the number of words e and n, 
causing a sub-utilization of stages in the pipeline. 



Execution time for an Area of 20,000 m= 1024,512,256 




Fig. 9. The execution time of the MM hardware for various precision and configurati- 
ons. 



For a fixed area, the word size becomes a function of the number of stages 
only. The word size decreases as the number of stages in the pipeline increases. 
The word size for some values of n is given on Table 1. 



Table 1. The number of pipeline stages versus the word size, for a fixed chip area. 



n (stages) 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


w (bits) 


423 


194 


126 


93 


74 


61 


52 


45 


40 


36 
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From the synthesis tools we also obtained a minimum clock cycle time of 
11 ns (clock frequency of 90MHz). For the case m = 1024 bits, n = 10 stages, 
and w = 36 bits, the total execution time is 3107 11 = 34,177 nanoseconds. 

The correction step was not included in these estimates, but it would require 
another pipeline cycle to be performed. 

8 Conclusions 

We presented a new architecture for implementing the Montgomery multiplica- 
tion. The fundamental difference of our design from other designs described in 
the literature is that it is scalable to any operand size, and it can be adjusted to 
any available chip area. The proposed architecture is highly flexible, and provides 
the investigation of several design tradeoffs involved in the computation of the 
Montgomery multiplication. Our analysis shows that a pipeline of several units 
is more adequate than a single unit working with a large word length. This is an 
interesting result since using more units we can reduce the word size and conse- 
quently the data paths in the final circuit, reducing the required bandwidth. The 
proposed data path for the multiplier was synthesized to a circuit that is able 
to work with clock frequencies up to 90MHz (for the CMOS technology consi- 
dered in this work). The total time to compute the Montgomery multiplication 
for a given precision of the operands will depend on the available area and the 
chosen pipeline configuration. The upper limit on the precision of the operands 
is dictated by the memory available to store the operands and internal results. 
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Abstract. This paper investigates the hardware implementation of 
arithmetical operations (multiplication and inversion) in symmetric and 
alternating groups^ as well as in binary permutation groups (permutation 
groups of order 2'^). Various fast and space-efficient hardware architec- 
tures will be presented. High speed is achieved by employing switching 
networks, which effect multiplication in one clock cycle (full parallelism). 
Space-efficiency is achieved by choosing, on one hand, proper network ar- 
chitectures and, on the other hand, the proper representation of the group 
elements. We introduce a non-red undant representation of the elements 
of binary groups, the so-called compact representation^ which allows low- 
cost realization of arithmetic for binary groups of large degrees such as 
128 or even 256. We present highly optimized multiplier architectures 
operating directly on the compact form of permutations. Finally, we give 
complexity and performance estimations for the presented architectures. 

Keywords: permutation multiplier, switching network, destination-tag 
routing, sorting network, separation network, binary group, compact re- 
presentation, secret-key cryptosystem, PGM. 



1 Introduction 

Several cryptosystems, such as RSA, elliptic curve systems, IDEA or SAFER, utilize 
operations in algebraic domains like polynomial rings or Galois-fields. Efficient 
implementations of the basic arithmetical operations in those domains have been 
extensively studied but not much attention has been spent to simpler constructs 
like permutation groups. Our research on permutation group arithmetic has 
been motivated by the implementation of a secret-key cryptosystem called PGM 
(Permutation Group Mapping) [11,12], which utilizes some generator sets, called 
group bases^ for encryption. 

Briefly, a basis for a permutation group G is an ordered collection (3 = 
. . . , B^_i) of ordered subsets (so-called blocks) Bi = . . . , 6^^^._i) of 

G, such that each element g £ G has a unique representation in form of a product 
9 = bo,xo ' h,xi ' , where hi^^. e Bi and 0 < < r— 1. Thus,/? 

defines a bijective mapping (3 \ G ^ X which assigns to each element ^ G G a 
unique vector x = (xq, . . . ,x^t,_i) G X, where X = x 2Zr^ x • • • x 2Zr^_^- 
Clearly, |G| = |X| = rori • • (3~^ can be effected by means of permuta- 

tion multiplications, whereas (3 involves finding the proper factors in /?, and is 
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hence called factorization. There is a huge amount of so-called transversal bases, 
for which factorization can be effected very efficiently by means of permutation 
multiplications and inversions. A pair of such, randomly chosen bases (3\ and P 2 
for some group G (called the carrier group) form the key for PGM. The encryp- 
tion of a cleartext message m is c = Decryption is performed in 

the same manner by exchanging the roles of f3i and /? 2 , i.e. ^ = $i{$f^{c)). 
To accomodate the cryptosystem to some binary cleartext and ciphertext space 
A4 = C = an additional, fixed mapping X : A4 ^ X has to be effected 

prior to and after the actual encryption. 

It is very natural to represent permutations in a computer in the so-called 
Cartesian form. Section 2 introduces the basic principle of multiplying two Car- 
tesian permutations in a switching network. Section 3 presents different, mostly 
novel multiplier architectures operating in the symmetric group. Unfortunately, 
a symmetric carrier group Sn has a serious drawback. It is namely that any basis 
for (n > 2) has several blocks with length 7 ^ 2^b It follows that A, which 
can be seen as a conversion from a binary to a mixed radix r = (^o, ri , . . . , 
is computationally rather intensive [14]. 

As oppesed to a symmetric group, any basis for a binary group (a permutation 
group of order 2'^) has block lengths = 2 ^y and thus the mapping A for such a 
carrier group is trivial. Since a binary group of degree n is only a small subgroup 
of the symmetric group 5^, the use of some large degree (n = 128 . . . 256) is indi- 
cated, which makes the use of use of the multipliers, proper proper for symmetric 
groups, infeasible. The problem lies in the fact that the Cartesian representa- 
tion of binary group elements contains a large amount of redundancy. In Sec. 4 
we introduce a novel, non-redundant representation, the so-called compact re- 
presentation^ and present various multiplier architectures operating directly on 
the compact form of permutations. Finally, Sec. 5 gives complexity and per- 
formance estimations for the presented architectures. It turns out that a PGM 
system with a binary carrier group is indeed much more efficient than one based 
on a symmetric group of similar order. 

2 Multiplication in Permutation Networks 

To briefly recall, a permutation p of degree n is a bijection p : L ^ where L 
is a set of n arbitrary symbols or points. In the arithmetic we propose, elements 
of the symmetric group Sn are represented in the so-called Cartesian form. In 
this representation, L = {0, 1, . . . , n — 1} and a permutation p G is a vector 
p = (p(0),p(l), .../p{n — 1)) of the n function values. Suppose now, elements of 
vector p are physically stored in their natural order in a block F of registers, i.e. 
F[i] = p{i) for 0 < i < n — 1, where F\i] denotes the content of the register. 

By definition, the product of two permutations a and b is permutation q = 
a ' b : q{i) = 6 (a(i)), for 0 < i < n — 1. Using the Cartesian representation, 
the product q can be computed by means of n memory transfer operations: 
Q[i] := i^[A[i]j for 0 < i < n — 1 , where A, B and Q are the memory blocks 
storing a, b and g, respectively. For simplicity of the notation, we are not going 
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to distinguish register blocks from their content, but simply write Q = A ^ B to 
denote the product. 

By definition, the inverse of a permutation a is permutation q = a“^, for 
which a • a~^ = a~^ - a = where l denotes the identity permutation (i.e. 

b{i) = i). The inverse a~^ can be obtained in block Q by applying n memory 
transfers '.= i. We denote the inverse simply as Q = A~^. 

The memory transfers can be carried out either sequentially or in parallel. 
The former is the typical software implementation. The parallel implementation 
exploits the fact that the n memory transfers are completely independent and 
can thus be carried out simultaneously in switching networks^ as follows. For 
multiplication, the A[if^ register of source block B is connected to the i^^ register 
of the destination block Q, i.e. A is interpreted during routing as a vector of 
source addresses. After setting up the network, the content of B is copied to Q 
via the established connections, forming the product A’BvnQ. Fig. la illustrates 
this principle on a small example. 




Fig. 1. a, Multiplication and b, inversion in a switching network 



When interpreting A as a vector of destination addresses, the reverse connec- 
tions are established. By copying then the content of B to Q, the product A~^ B 
is obtained in Q. By substituting B = the network delivers the inverse of A, 
as shown in Fig. lb. 

3 Arithmetic in Symmetric Groups 

According to the computing principles introduced above, a multiplier network 
for Sn should be an n-input, n-output network (briefly (n,n) network). For the 
sake of full parallelism, the network must be able to connect the n inputs to 
the n outputs simultaneously, that is without collision (or blocking) at any of 
the links. Since the network has to be completely re-routed after each multi- 
plication, rearrangeable networks are favourable compared to the more complex 
non-blocking networks [3]. Moreover, since the routing operand may come from 
the entire symmetric group 5n, the network must be able to realize all possi- 
ble n-to-n connections. Exactly these characteristics describe a specific class of 
switching networks, called permutation networks. 
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3.1 Crossbar Networks 

The crossbar network is the most fundamental single stage permutation network 
[3]. As depicted in Fig. 2, it consists of n x n switches in a matrix form, which 
pass the signals from input port C to output port Q. In the following, we propose 
three different schemes for fast routing of the network. 

In the first routing scheme the pure crossbar network is equipped with a 
routing port A and with corresponding horizontal lines, each controlling an n- 
to-1 multiplexer (mux), as shown in Fig. 2a. According to the control signal, each 
MUX selects one of the n signals of port C, and forwards it to port Q. Note in this 
mechanism that the data items entered at port A are used as source addresses. 
Hence, when entering Cartesian permutations at A and ( 7 , the network computes 
the product Q = A - C. 




Fig. 2. a, MUX-type multiplier b, DMUX-type multiplier 



The second routing scheme in Fig. 2b adds routing port B and corresponding 
vertical lines to the pure crossbar network. Each of these lines controls a 1-to-n 
demultiplexer (dmux) which transmits the input signal through one of the n 
output lines towards Q, while disconnecting from all other output lines. Note 
that data items of A are interpreted in this mechanism as destination addresses. 
Accordingly, when entering Cartesian permutations at B and C, the network 
computes the product Q = B~^ • C . 

A combination of the above two routing schemes yields the third one of Fig. 3. 
Both routing ports A and B are included here, and are connected to horizon- 
tal and, respectively, vertical addressing lines. In addition, each switching cell 
is equipped with an equivalence comparator, which compares the addresses 
received from the neighboring addressing lines. If the addresses are equal, the 
comparator closes the switch, otherwise opens it. By entering Cartesian permu- 
tations at ports A, B and C respectively, the result obtained at port Q is the 
product Q = A ‘ B~^ • C. A more detailed description of this architecture and of 
a bit-parallel realization has been published in [13]. 
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Fig. 3. A 3-operand crossbar multiplier 



3.2 Sorting Networks 

The Benes-network [3,5] is known to be the most efficient rearrangeable multis- 
tage network topology, based on elementary (2,2) switching cells. However, its 
routing algorithm, the so-called looping algorithm [5], is intrinsically sequential 
and can only be effected in a centralized control unit. Accordingly, the mechanism 
is rather slow and thus not suitable for a multiplier network. On the other hand, 
there exists a large class of multistage networks, the so-called digit- controlled 
(or delta) networks, which possess a very convenient routing algorithm, the so- 
called destination-tag routing (or self-routing) [5,7,8,10]. This distributed control 
mechanism is very fast, since the individual cells decide independently and si- 
multaneously. Unfortunately, delta networks are blocking ones. 

A sorting network is an (n, n) multistage network effecting some deterministic 
sorting algorithm [2, 4, 6, 9]. The network is built from elementary (2,2) compare- 
exchange modules. Each module compares the two incoming numbers and routes 
them according to their magnitudes. No matter in which order input numbers 
are entered at the input, the network applies the proper permutation to them 
and delivers the sorted sequence. Hence, any sorting network can be regarded as 
a rearrangeable permutation network. 

Though all known sorting networks are more complex than the Benes net- 
work, they offer a way for destination-tag routing, as follows. If entering elements 
a(i) of a Cartesian permutation a at input A[i] (0 < i < n — 1), the network 
forwards each a{i) to output Q[a{i)]. Put another way, vector a carries routing 
information and designates n parallel paths from A[i] to Q[a{i)]. When now 
attaching the elements of a permutation b to a, i.e. entering packets of the form 
(a(i),6(i)) at input line i, destination tag a{i) will route b{i) through the net- 
work towards Q[a{i)], i.e. eventually Q[a{i)] = b[i] is obtained. It is seen that the 
Cartesian permutation q obtained at Q is q{a{i)) = 6(i), or equivalently, a-q = b, 
and thus g = • 6, which fact amounts to the general multiplication principle 

of Sec. 2. Figure 4 illustrates the method on a small example. 
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Fig. 4. Multiplication in a sorting network 



In Fig. 5 we introduce two classical sorting networks. The odd-even trans- 
position sorter is the parallel implementation of the insertion sort and, at the 
same time, of the selection sort algorithms. The n input numbers are sorted in n 
stages, comprising n(n— l)/2 modules. Accordingly we say that the network has 
depth n and complexity n(n — l)/2. The arrows in the symbols of the compare- 
exchange modules indicate the direction which the larger numbers are forwarded 
to. Note that this network has a completely “straight” wiring topology, which 
is advantageous in view of wiring area. The hitonie sorter as well as Bateher^s 
odd- even sorter [9] are known to be the most efficient regular topologies, having 
0(log^ n) stages in a recursive structure. Note that many lines cross between 
certain stages, which is in direct correspondence with the wiring area. 





Fig. 5. a, The odd-even transposition and b, the bitonic sorters 



In a straightforward realization of the compare-exchange modules, compari- 
son is carried out first, and the result is then used to set the switches. In this 
method, no data can be transferred until the comparison is completed. Con- 
siderable acceleration can be achieved by recognizing that comparison can be 
performed sequentially, scanning from the MSBs towards the LSBs of the input 
numbers X and T, according to the following algorithm: 

1. As long as Xi = Yi while scanning bits in decreasing order of i, it does not 
matter how the switch is set, and thus Xi and Yi can be passed to the next 
stage; 

2. as soon as difference is noticed at bit j, i.e. Xj ^ Yj^ all switches for bits 
i < j can be set to the same state, which is determined by the relation of 
Xj and Yj. 
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In the improved scheme, corresponding bits Xi and Yi are transmitted to the 
next stage immediately after their comparison, that is before the comparison 
of lower bits is completed. The higher order bits reach the next stage therefore 
earlier than the lower order ones, where their comparison starts immediately. In 
this way, each comparator stage delays the destination tag effectively only by 
the time of a single bit-comparison. For implementation details we refer to [14]. 

3.3 Separation Networks 

Sorting networks are able to sort arbitrary number sequences. Note however 
that Cartesian permutations are special sequences, such that each number of the 
range 0 ... n — 1 occurs exactly once. This kind of sequence we call a permutation 
sequence. In the following, we introduce a class of novel permutation network 
architectures, which exploit this property to reduce hardware complexity. The 
new networks employ the radix sorting algorithm [1] for routing: destination 
addresses are represented as binary strings, starting with the MSB as first letter. 
Sorting proceeds as follows: first the strings starting with a ’F as first letter are 
separated from those starting with a ’O’. As second step, both of the resulting 
subsequences are further be split up so that strings having ’1’ as second letter get 
separated from those having ’0’ at the same position. The ’ divide- and-conquer’ 
principle is followed in this way till the last step, where strings with trailing ’1’ 
are separated from those with trailing ’O’. 

Since a permutation sequence contains a predetermined set of strings, the 
number of strings with ’1’ and respectively ’0’ at any particular position is con- 
stant, irrespective of the actual sequence. Due to this fact, the length of the 
separated subsequences is known and constant for all separation steps. In the 
specific case of n = 2"^, all separated subsequences are balanced^ i.e. contain 
exactly as many I’s as O’s at any particular position. This property is the basis 
for the design of separator networks. Each separation step is effected in a dedi- 
cated separator stage. The first separator stage splits the input sequence in two 
halves of length n/2 (without actually achieving perfect ordering), the next stage 
produces subsequences of length n/4, and so on. Networks of degree n ^ 2^ can 
be constructed by omitting parts of a network of degree 2"^, where n < 2"^. 

The strength of the technique lies in the fact that any particular stage can 
achieve the separation by looking at corresponding single bits of the destina- 
tion tags. The method can thus be considered as the generalization of the bit- 
controlled self-routing algorithm for permutation networks. Interestingly, com- 
paring corresponding bits X and Y of two destination tags and routing them 
towards the proper output H (“higher” value) and respectively L (“lower” va- 
lue) requires no logic at all. To see this, consider the truth-table of the “binary” 
compare-exchange module: 



X 


Y 


switch state required 


switch state chosen 


H 


L 


0 


0 


don’t care 


across 


0 


0 


0 


1 


across 


across 


1 


0 


1 


0 


straight 


straight 


1 


0 


1 


1 


don’t care 


straight 


1 


1 
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By choosing the switch state for don^t carets as shown above, the switches 
can be controlled directly by input bit X, whereas H and L can be formed by a 
single OR- and respectively AND-gate. See [14] for implementation details. 

In the following, we present a couple of novel separator network architectures. 
The first scheme is related to the bitonic sorter (Fig. 5b). The two bitonic sorters 
of length n/2 and a half-cleaner stage of this network form a so-called selection 
network [9], which separates the n/2 largest from the n/2 smallest elements. If 
entering a sequence of n/2 Fs and n/2 O’s, the selection network separates the 
Fs from the O’s. Clearly, this is also achieved when the magnitude comparator 
modules are replaced with “binary” comparator modules. Such bitonic separator 
stages can be used to build a bitonic separator network^ as illustrated in Fig. 6 
for n = 8. The network has depth of order 0(log^ n). 




Fig. 6. A separator network based on bitonic separators 



Note that the (n/2,n/2)-sorters in front of the half-cleaner can actually be 
replaced by any kind of sorting network, for instance, by odd-even transposition 
sorters. We call the network obtained in this way the linear odd-even separator 
network. As the name suggests, it has depth of order 0(n). 

Separator stages can rely on other principles, too. The sorter depicted in 
Fig. 7 employs a novel separator type of depth 0(n), which we call a diamond 
separator. The underlying sorting principle is similar to that of the odd-even 
transposition sorter. The advantage of this architecture is the completely straight 
wiring pattern. Its drawbacks are that the network is rather deep and hard to 
lay-out in a rectangular form. 

The rotation separator offers lower depth, rectangular layout and still a 
“nearly” straight wiring topology. The separation principle can be followed in 
Fig. 8. Links running across the network are considered to be of two types: 0- 
lines^ which are expected to deliver O’s at the output, and 1-lines^ that should 
deliver I’s. A ’1’ on a 0-line (and a ’0’ on a 1-line, respectively) is considered 
as a 1-error (a 0-error^ respectively). Due to the balance in a permutation se- 
quence of length n = 2"^, 1-errors are present in the same number as 0-errors 
at any particular stage. Each compare-exchange module receives input from a 
0-line and a 1-line, and outputs to a 0-line and a 1-line. When a 1-error and 
a 0-error are received, they “neutralize” each other, i.e. both errors disappear. 
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QO 

Q1 

Q2 

Q3 

Q4 

Q5 

Q6 

Q7 



Fig. 7. The ’’diamond” separator network with 8 inputs 



The topology of the network implements the following strategy for eliminating 
all errors in the input sequence: 0-lines (carrying potentially 1-errors) are iterati- 
vely “rotated around” and combined pairwise with 1-lines (carrying potentially 
0-errors). In order for all 0-lines to be combined with all 1-lines, n/2 rotation 
steps are needed, and hence the separator stage has depth n/2. The total depth 
of the entire network is n — 1 . 




— 


1-line 


— 


0-line 




1 -error 


® 


0-error 


0 


hitting two errors 
at this module 



Fig. 8. A “rotation” separator stage with 8 inputs 



4 Arithmetic in Binary Groups 

As mentioned, any binary group of degree n is a subgroup of S^. Unfortunately, 
even a so-called Sylow-2 subgroup Hs of 5n, which is of maximal order, is rather 
small; it has order \Hs \ = if n = 2^. Hence, if a certain group size is requi- 
red, the usage of a binary group of some large degree is indicated. For instance, 
if a group order of at least 2^^^ is required, not unusual in cryptographic appli- 
cations, either a symmetric group of degree n = 34 or a Sylow-2 subgroup of 
degree n = 128 may be chosen. Unfortunately, the storage of Cartesian permu- 
tations (7 >k 127=896 bits in the above example) as well as the multipliers based 
on permutation networks are very extensive for binary groups of such large de- 
grees. Note however that the Cartesian form is very redundant for representing 
binary group elements, and that the multiplier networks would be used rather 
inefficiently too, because most of the possible permutation patterns, namely thus 
in Sn but not in 77s, would never be configured. 
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A study of the indirect binary cube (iBc) network for n = 2^ has shown that 
though the set of permutations realized by the network is not a group, it embeds 
a Sylow-2 subgroup T~Ls of Sn- Similar results can be obtained for the “inverse” 
of the IBC network, the so-called generalized cube (also called butterfly or SW- 
B any an) network, as well as for other (n, n) delta networks, such as the omega^ 
the baseline^ the m.odified data ivMMipulator mdm and their respective “inverses” , 
the reverse omega (also called flip)^ the reverse baseline and the inverse mdm 
networks [5,10]. The different delta networks realize various instances of Hg, 
while it is known that all Sylow-2 subgroups Hs of Sn are isomorphic. 

A delta network of degree n = 2^ comprises s ^n/2 switches in s stages, and 
is thus considerably more efficient for T~Ls then any of the permutation networks. 
The construction of a multiplier we illustrate on the IBC network of degree n = 8, 
depicted on the left of Fig. 9. The network contains 12 binary switches and since 
it is a banyan network (i.e. there is one unique path from each input to each 
output), all of the 2^^ different configurations realize different permutations. This 
permutation set of size 2^^ is not a group, but it contains Hs, which is of order 
2^. Clearly, some configurations will never be used if working in H3. 




Fig. 9. The indirect binary cube network for n = 8 in two presentations 



As illustrated in the figure, the IBC network can be configured by the bit- 
controlled self-routing algorithm, where switches in the first stage are controlled 
by the LSBs, while subsequent stages by succeeding bits of the destination tags. If 
first routing with a Cartesian permutation a G Hs and then transferring another 
permutation 6 G Hs, the network delivers, according to the general multiplication 
principle of Sec. 2, the product q = • 6. 

It turns out that if working in Hs, switches of certain switch groups are 
always set to a common state, and can thus be unified in one switching module, 
as depicted on the right side of the figure. The unified switches can be controlled 
by one common signal, which sets either a “straight” or a “swapping” connection 
pattern. We call an IBC network with unified switches an UIBC network. The 
2^^“^=2^ different connections patterns of the UIBC network realize exactly the 
elements of (a specific instance of) Hs- 

The control bits can be extracted from the Cartesian permutation a by sim- 
ply selecting certain bits of a. Actually, the n— 1=7 control bits can be seen as 
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a special representation of the group elements of H 3 , which we call the compact 
representation. From the fact \Hs\ = it is seen that the compact repre- 

sentation is non- redundant and hence optimal. Expanding the compact form to 
the Cartesian form is similarly simple, it can achieved by reproducing (copying) 
certain bits of the compact permutation. Note that the ease of the conversions 
is not a general feature but specific to the instance of Hg induced by the use of 
the UIBC network. 

A great advantage of the compact representation is that it allows space- 
efficient storage of elements of T~Ls- Note furthermore that since the Cartesian 
form of 6 G Hs is redundant, more bits than actually necessary are transferred 
by the UIBC network while multiplying. By removing links and switching compo- 
nents from the network which convey redundant bits of 6 , the complexity of the 
scheme can be significantly reduced. The optimized scheme transmits merely the 
compact form of 6. The resulting multiplier network, called MULAIB, is shown in 
Fig. 10 for n = 8 . 




Fig. 10. Different multipliers and inverters working on compact permutations 



Figure 10 illustrates further multiplier and inverter architectures deduced 
from the IBC network and respectively, from its inverse, the generalized cube 
network. All architectures work directly on the compact form of operands. 

Above we followed an illustrative approach to introduce the arithmetic for 
binary groups. An accurate description of the construction of group Hs under- 
lying the arithmetic, a formal definition of the compact representation, proofs 
of the multiplication algorithms, further multiplier and inverter schemes as well 
as a generalization of the theory to a large class of binary groups of arbitrary 
degree n have been omitted here in lack of space, but can be found in [14]. 
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5 Conclusions 

In the following, we give complexity and performance estimations for the presen- 
ted multiplier architectures. The examined multipliers operate in the symmetric 
group 5 s 2 and in the binary group (degree n = 128), which have comparable 
orders: |<S 32 | ^ 2^^^ and respectively \H 7 \ = 2^^^. Estimations of complexity 
and delays have been made for the 0.7 /xm es2 standard-cell CMOS technology of 
European Silicon Structures. The complexity of the typically extensive wiring of 
switching networks, indicated also by the measure wiring widths has been taken 
into account. The throughput of the networks has been calculated for a purely 
combinational, full-parallel, non-pipelined implementation. The estimation me- 
thodology as well as other implementation styles are detailed in [14]. Table 1 
below summarizes the results. 



Table 1. A comparison of multipliers for S 32 and Hr 





Topology 


1 Complexity | 


1 Performance | 


Multiplier 




# sw. 


gate 


wiring 


area 


gate 


perf. 


perf./ 


Design 


depth 


modules 


count 


width 


mm^ 


delay 


MMPS 


area 


MUX-type crossbar 


1 


1024 


8.68K 


640 


11.8 


9 


222 


18.8 


DMUX-type crossbar 


1 


1024 


12.7K 


640 


15.8 


12 


167 


10.5 


3-operand crossbar 


1 


1024 


25. OK 


640 


27.6 


12 


167 


6.04 


Linear oev. sorter 


32 


496 


21. 5K 


0 


16.3 


72 


27.8 


1.70 


Bitonic sorter 


15 


240 


10.4K 


820 


14.4 


38 


52.6 


3.65 


Linear oev. sep.net 


35 


560 


8.47K 


526 


10.2 


71 


28.2 


2.76 


Bitonic sep.net 


25 


400 


5.95K 


1030 


10.7 


51 


39.2 


3.67 


Diamond sep.net 


46 


496 


7.54K 


0 


5.73 


93 


21.5 


3.75 


Rotation sep.net 


31 


496 


7.54K 


526 


9.26 


63 


31.7 


3.43 


MULAIB 


7 


756 


1.61K 


240 


3.17 


15 


133 


42.0 


MULAB 


7 


756 


1.61K 


240 


3.17 


19 


105 


33.1 


MULABI 


7 


756 


1.61K 


240 


3.17 


44 


45.5 


14.4 


INV 


7 


756 


1.33K 


240 


2.82 


13 


154 


54.6 



Among the multipliers for 832 ^ the crossbar architectures are very fast and 
cost-effective too. The bitonic separator network and the rotation separator net- 
work perform quite similarly, and are slightly smaller and slower than the well- 
known bitonic sorter. All multipliers for Hr have extremely low gate-complexity, 
whereas about 60% of the total area is spent for global wiring in all designs. 
The reason that mulab performs significantly worse than MULAIB is that con- 
trol signals, that are to be distributed at a particular stage, are produced by the 
preceeding stage. Therefore, the delay of signal distribution adds to the total 
delay at each stage, a rather undesirable phenomenon. 

To summarize, the multipliers for the binary groups outperform those for 
the symmetric group and because of their O(nlogn) complexity, the gain beco- 
mes even more striking for larger groups. We stress here again that the very 
fundamental invention which allows both space-efficient storage and efficient 
computation in binary groups is that of the compact representation. 
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Abstract. A method is described for performing computations in a fi- 
nite field GF(2^) by embedding it in a larger ring Rp where the multipli- 
cation operation is a convolution product and the squaring operation is a 
rearrangement of bits. Multiplication in Rp has complexity A + 1, which 
is approximately twice as efficient as optimal normal basis multiplica- 
tion (ONB) or Montgomery multiplication in GF(2^), while squaring 
has approximately the same efficiency as ONB. Inversion and solution 
of quadratic equations can also be performed at least as fast as previous 
methods. 



Introduction 

The use of finite fields in public key cryptography has blossomed in recent years. 
Many methods of key exchange, encryption, signing and authentication use field 
operations in either prime fields GF(p) or in fields GF(2^) whose order is a 
power of 2. The latter fields are especially pleasant for computer implementation 
because their internal structure mirrors the binary structure of a computer. 

For this reason there has been considerable research devoted to making the 
basic field operations in GF(2^) (especially squaring, multiplication, and inver- 
sion) efficient. Innovations include: 

• use of optimal normal bases [15]; 

• use of standard bases with coefficients in a subfield GF(2’^) [26]; 

• construction of special elements o; G GF(2^) such that powers can be 
computed very rapidly [5,6,8]; 

• an analogue of Montgomery multiplication [14] for the fields GF(2^) [13]. 

The discrete logarithm problem in finite fields can be used directly for crypto- 
graphy, for example in Diffie-Hellman key exchange, ElGamal encryption, digital 
signatures, and pseudo-random number generation. (See [3,9,16,17,22] for a di- 
scussion of the difficulty of solving the discrete logarithm problem in GF(2^).) 
An alternative application of finite fields to cryptography, as independently sug- 
gested by Koblitz and Miller, uses elliptic curves. In this situation the finite 
fields are much smaller (fields of order 2^^^ and 2^^^ are suggested in [10]), but 
the field operations are used much more extensively. Various methods have been 
suggested to efficiently implement elliptic curve cryptography over GF(2^) in 
hardware [I] and in software [11,23]. 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 122-134, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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The standard way to work with GF(2^) is to write its elements as poly- 
nomials in GF(2)[J^] modulo some irreducible polynomial of degree X . 

Operations are performed modulo the polynomial ^{X), that is, using division 
by ^{X) with remainder. This division is time-consuming, and much work has 
been done to minimize its impact. Frequently one takes ^{X) to be a trinomial, 
that is a polynomial X^ aX^ + b with only three terms, so as to simplify the 
division process. See, for example, [23] or [10, §6. 3, 6. 4]. Montgomery multiplica- 
tion replaces division by an extra multiplication [13], although this also exacts 
a cost. 

A second way to work with GF(2^) is via normal bases, especially optimal 
normal bases [15], often abbreviated ONB. Using ONB, elements of GF(2^) are 
represented by exponential polynomials ao/? + ai/?^ + a 2 /?^ + • • • + a 7 v-i/?^ 
Squaring is then simply a shift operation, so is very fast, and with an “optimal” 
choice of field, multiplication is computationally about the same as for a standard 
representation. More precisely, the computational complexity of multiplication is 
measured by the number of 1 bits in the multiplication transition matrix (A^j). 
The minimal complexity possible for a normal basis is 2N — 1, and optimal 
normal bases are those for which the complexity is exactly 2N — 1. (The ONB’s 
described here are so-called Type I ONB’s; the Type II ONB’s are similar, but 
a little more complicated. Both types of ONB have complexity 2N — 1.) 

In this note we present a new way to represent certain finite fields GF(2^) 
that allows field operations, especially multiplication, to be done more simply 
and rapidly than either the standard representation or the normal basis represen- 
tation. We call this method GBB, which is an abbreviation for Ghost Bit Basis ^ 
because as we will see, the method adds one extra bit to each field element. The 
fields for which GBB works are the same as those for which Type I ONB works, 
but the methods are quite different. Most importantly, the complexity of the mul- 
tiplication transition matrix for GBB is A + 1, so multiplication using GBB is 
almost twice as fast (or, for hardware implementations, half as complex) as mul- 
tiplication using ONB. Further, squaring in GBB is a rearrangement of bits that 
is different from the squaring rearrangement (cyclic shift) used by ONB. (We re- 
fer the reader to [24] for a description of all fields having a GBB-multiplication.) 

[Important Note. The GBB construction is originally due to Ito and Tsu- 
jii [28]. See the note “Added in Proof” at the end of this article.] 

1 Cyclotomic Rings over GF(2) 

We generate the field GF(2^) in the usual way as a quotient GF(2)[A]/(^(A)), 
where we choose an irreducible cyclotomic polynomial of degree A, 

<P{X) = + X^-^ + -h • • • + + a + 1. 

As is well known, ^(A) is irreducible in GF(2)[A] if and only if 

• p = A -h 1 is prime. 

• 2 is a primitive root modulo p. 
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The second condition simply means that the powers 1, 2, 2^, 2^, . . . , 2^ ^ are 
distinct modulo p, or equivalently, that 

2^/^ ^ 1 (mod p) for every prime i dividing N . 

There are many N^s that satisfy these properties, including for example the 
values N = 148, 162, 180, 786, and 1018. (See Section 4 for a longer list.) 

Remark 1. As noted above, the primes satisfying conditions (1) and (2) are 
exactly the primes for which there exists a Type I ONB. However, ONB’s use 
these properties to find a basis for GF(2^) of the form 
For GBB, we will simply be using the fact that ^(A) is irreducible. 

We now observe that the field GF(2^), when represented in the standard 
way as the set of polynomials modulo ^(A), sits naturally as a subring of the 
ring of polynomials modulo A^ — 1. (Remember, N = p — 1.) In mathematical 
terms, there is an isomorphism 



GF(2)[A] 
{XP - 1 ) 



= GF(2^) X GF(2). 



This is an isomorphism of rings, not fields, but as we will see, the distinction 
causes few problems. 

For notational convenience, we let Rp denote the ring of polynomials modulo 

XP - 1 , 



_ GF(2)[A] 

- (xp - 1 ) • 

We interchangeably write polynomials as a = a^X^ ^ ^ aiX +ao and as a 

list of coefficients a = [a^v, uat-i, . . . ,uo]. 

Remark 2. Our method works more generally for fields GF(g^) for any prime 
power q provided p = A + 1 is prime and g is a primitive root modulo p. In 
this setting, the ring GF(g)[A]/(A^ — 1) is isomorphic to GF(g^) x GF(g). We 
leave to the reader the small adaptations necessary for g > 3. For most computer 
applications, g = 2 is the best choice, but depending on machine architecture 
other values could be useful, especially g = 2^ for > 2. 

We now briefly discuss the complexity of operations in Rp. More generally, 
let R be any ring that is a GF(2)-vector space of dimension n, so for example R 
could be Rp (with a = p), or R could be a field GF(2^) (with n = N). Let 
B = {/?o, • • • , f^n-i} be a basis for R as a GF(2)-vector space. Then each product 
f3if3j can be written as a linear combination of basis elements, 

n — 1 
/c=0 
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The complexity of multiplication relative to the basis B is measured by the num- 
ber of that are equal to 1, 

= : A«=l}. 

It is easy to see that C{B) > 1, and that if is a field, then C{B) > n. A 
more interesting example is given by a normal basis for R = GF(2^), in which 
case it is known [15] that C{B) > 2N — 1. A normal basis for GF(2^) is called 
optimal if its complexity equals 2N — 1. A complete description of all fields that 
possess an optimal normal basis is given in [7]. 

The complexity of the basis B = {1, A, . . . , for the ring Rp is clearly 

C{B) = since = 1 if and only if i ^ j = k mod p. In other words, for 
each pair (i, j) there is exactly one k with A-^^ = 1. So taking p = A + 1 as 
usual, we see that an optimal normal basis for GF(2^) has complexity 2A — 1, 
while the standard basis for Rp has complexity A + 1, making multiplication in 
Rp approximately twice as fast (or half as complicated) as in GF(2^). It is thus 
advantageous to perform GF(2^) multiplication by first moving to Rp and then 
doing the multiplication in Rp. 

A second important property of a basis for finite field implementations, es- 
pecially in hardware, is a sort of symmetry whereby the multipliers A^^^ are 

determined by the multipliers A-j^ by a simple transformation. We say that B 
is a permutation basis if there are permutations ak.Tk such that 

A-J^^ = A^^l.x ... for all 1 < i,j,k < n. 

^3 crk{i)rk{j) — — 

In practical terms, this means that the circuitry used to compute the first co- 
ordinate of a product ab can be used to compute all of the other coordinates 
simply by rearranging the order of the inputs. 

To see why this is true, we write a = aifSi and b = Y Then (after a 
little algebra) the product ab is equal to 

a6 = ^ ) A- 

k=0 \,j=o ^ 

If B is a permutation basis, we can rewrite this as 

n — 1 / n—1 \ n — 1 / n—1 \ 

= E ( E = E ( E A"' 

/c=0 ^i,i=0 ^ /c=0 E,J=0 ^ 

Thus the k^^ coordinate of ab is computed by first using the permutations ak 
and Tk to rearrange the bits of a and b respectively, and then feeding the rear- 
ranged bit strings into the circuit that computes the first coordinate of ab. 

It turns out that both ONB and GBB are permutation bases, but the cor- 
responding permutations are slightly different. For ONB one has the relation 
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while for GBB (i.e., for the standard basis in Rp) the relation 

is where in both formulas we are taking the subscripts modulo n. 

Thus the permutation for GBB is a little easier to implement than ONB because 
only one of the inputs needs to be shifted. 



2 Overview of Operations in Rp 

In this section we briefly describe some of the advantages of working in the 
ring Rp. In Section 3 we will discuss these in more detail. 



2.1 Moving between GF(2^) and Rp 

An element of GF(2^) is simply a list of N bits, and similarly an element of Rp 
is a list of A + 1 bits, 

[aAT-i,... ,ai,ao] G GF(2^) and [aAr, a^v-i, . . . ,ao] G Rp. 

We call the extra bit in Rp the “ghost bit”. In order to do a computation in 
GF(2^), we first move to Rp^ next do all computations in Rp^ and Anally move 
the Anal answer back to GF(2^). Movement between GF(2^) and Rp is extre- 
mely fast, at most a single complement operation. 

More precisely, the map from GF(2^) to Rp is given by 

GF(^2 ^ y Rp , (1 — 1 , . . . , ^ [0 ? dj\f— i , . . . , d \ , aq] • 



That is, we simply pad a by setting the ghost bit equal to zero. Moving in the 
other direction is almost as easy. If the ghost bit is zero, we drop it, while if the 
ghost bit is one, we Arst take the complement: 



Ap ^ GF(2^), [a^v, . 



, ai, ao] i-G 



[dN-l, . 

^ [a^v- 



1: • 



,ai,ao] if aAT = 0, 
. . , ai, ao] if a^v = 1- 



Here ^ means take the complement, that is, flip every bit. If this isn’t available 
as a primitive operation, one can XOR with 1111 ••• 111. 



2.2 Addition in Rp 

Addition in Rp is the usual addition of vectors over GF(2). That is, the coordi- 
nates are added using the rules 0 + 0 = l + l= 0 and 0 -hl = l + 0 = l. 

2.3 Squaring in Rp 

The squaring operation in Rp is very fast. One simply interleaves the top order 
bits and the bottom order bits. Thus if a = [dN,(iN-i , . . . ,ao] G Rp^ then 

d^ = [(In/2: <^W/ 2 + 2 ? AaT/2+17 • 

Fast squaring can be implemented using the operation that takes a u?-bit word 
[b^ , b^-i , . . . , 6 i] and returns the two words [ 6 ^^, , 0 , b^-i, 0 , . . . , 0 , b^^ 2 -\-i y 0 ] 
[ 6 ^/ 2 , 0, ... ,b2,0,bi,0]. This is trivial to implement in hardware, while in soft- 
ware it might be quickest to implement at the word level using a look-up table. 
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2.4 Multiplication in Rp 

Multiplication in Rp is extremely fast, because the transition matrix for multi- 
plication has complexity p (i.e., it is a p-hy-p matrix with p entries equal to 1 
and the rest 0). Multiplication in Rp is simply the convolution product of the 
coefficient vectors: 



a{x)b{x) = y2[ E 

where we understand that the indices on a and b are taken modulo p. We will 
discuss in more detail below various ways in which to optimize the multiplication 
process. 

Remark 3. Since multiplication c = a 6 is simply the convolution product of the 
vectors a and 6 , it would be nice to use Fast Fourier Transforms to compute 
these convolutions. Unfortunately the vectors have dimension which is prime, 
so FFT does not help. On the other hand, some speed-up may be possible using 
the standard trick (Karatsuba multiplication) of splitting polynomials in half 
and replacing multiplications by additions, see for example [2, §3.1.2]). 

2.5 Inversion in Rp 

Inversion in Rp and in GF(2^) are extremely fast. Not all elements of Rp are 
invertible, but we are really interested in computing inverses in GF( 2 ^). (Aside: 
a E Rp is invertible if and only if a 7 ^ # and a has an odd number of 1 bits.) 
An especially efficient way to compute these inverses is the “Almost Inverse 
Algorithm” described in [23, §4.4]. Given a polynomial a{X) e GF( 2 )[A], the 
almost inverse algorithm efficiently finds a polynomial A{X) so that 

a{X)A{X) = X^ (mod ^(A)) 

for some exponent 0 < A: < 2N . Then a{X) ^ = X ^A(A), where the product 
X~^A{X) is easily computed as a cyclic right shift in Rp. Compare with [23], 
where the final step of dividing by requires more work. (Use of the almost 
inverse algorithm is also efficient for computing inverses using ONB’s, especially 
of Type I, see [21, §11. 1].) 

We also mention the well-known alternative method of inversion via multi- 
plication using the relation 

a(A)^ = I (mod A^ — I) for all invertible a(A) G Rp. 

Thus we can compute the inverse of an invertible a(A) by repeated squaring 
and multiplication 



a{X)-^ = a(A)2"^-2 



(mod A^-I). 
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2.6 Quadratic Equations in Rp 

For elliptic curve applications, it is important to be able to solve the equation 
+ z + c = 0 in GF(2^), see [21, §6.5]. Not all such equations are solvable, 
the necessary and sufficient condition being Tr(c) = 0, where Tr is the trace 
map GF(2^) ^ GF(2). The analogous condition for Rp says that 2 ;^ -h z + c = 0 
has a solution (actually 4 solutions) in Rp if and only if cq + ci + • • • + Civ =0 and 
Co = 0. If there is a solution, then a solution may be computed using a recursion 
coming from the formula 



The recursion is very simple because the squaring and square root operations 
in Rp are so simple. We also note that if Cq + Ci + • • • + c^v = Cq = 1, then there 
will still be a solution to + z -h c = 0 in GF(2^). This solution may be found 
by first replacing c with its complement ^ c, next solving z‘^ z c = 0 in Rp^ 
and finally mapping the result back to GF(2^) in the usual way. 

Remark 4- It is possible to use an (automatically “optimal”) normal basis in the 
ring Rp. To do this, write each element of Rp in terms of the basis 

V- v-2 v"4 

A, A , A , • , A , 

where it is understood that the exponents are reduced modulo p. All of the usual 
comments that apply to normal bases in GF(2^) apply to using a normal basis 
in the ring Rp. (See [15], [19], or [21, chapter 4] for information about optimal 
normal bases.) In particular, if a normal basis is used in Rp^ then the complexity 
has the usual “optimal” value of 2p — 1, and it is necessary to use a log table 
and an anti-log table to sort out the exponents when doing multiplications. 
Thus using a normal basis in Rp leads to slower multiplications than using the 
standard polynomial basis. On the other hand, squaring using a normal basis is 
simply a shift of bits, while squaring with the polynomial basis is interspersion 
of bits, so it is conceivable that situations or architectures might exist for which 
the normal basis is preferable. 



Remark 5. For Diffie- Heilman key exchange, ElGamal encryption, and similar 
applications, there is no reason to move back and forth between GF(2^) and Rp. 
One could do all the work in Rp and move back to GF(2^) at the end. (Even 
in Rp^ only one bit is exposed, namely evaluation at A = 1. Thus for the discrete 
logarithm problem a{X)^ = b{X) in Rp^ an attacker only deduces either 0^ = 0 
or 1^ = 1 in GF(2), so he gains no information about the exponent k.) 

A similar comment applies when working with elliptic curves, keeping in mind 
that not all elements of Rp have inverses. Thus when computing the reciprocal 
of a(A) E Rp^ if a has an even number of 1 bits, then a must first be replaced 
by its complement. 
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Remark 6. For certain finite fields, essentially Type II ONE fields [8] and their 
generalizations [5,6], it is possible to construct special elements called Gauss pe- 
riods whose powers can be computed extremely rapidly. An interesting feature 
of these constructions is that the exponentiation process makes use of a “redun- 
dant representation” (see [6, page 345]), which is analogous to our “ghost bit”. 
However, the bases used in [5,6,8] are normal bases and the fast exponentiation 
operation only applies to special elements, while the GBB construction in this 
paper gives a fast multiplication for arbitrary elements. Thus the two construc- 
tions are fundamentally different, as is also apparent from the fact that the fields 
to which they apply are different. 

3 Bit-Level Description of Operations in GF(2^) and Rp 

In this section we give bit-level descriptions of the basic operations described in 
Section 2. It is relatively straightforward to give analogous word-level descripti- 
ons, although for full efficiency it is important to use all of the usual programming 
tricks. 

Remark 7. The algorithms in this section take as input an element of GF(2^) 
and return an element of GF(2^) using the standard basis for GF(2)[A]/^(A). 
As noted above, in practice one could do all computations in Rp and only move 
the answer back to GF(2^) as the very last step. 

Remark 8. Polynomials in the following algorithms are written in the form 

a [N] X^N+a [N-1] (N-1) + . . . +a [2] X^2+a [1] X+a [0] . 

Thus a[i] refers to the coefficient of Ah We stress this point because when 
implementing these algorithms, it can be confusing (at least to the author) if 
the vector of coefficients is stored from high-to-low, instead of from low-to-high. 
We also note that a{X) + ^(A) is the complement ^ a(A) of a(A). This is 
correct because ^(A) = A^ + -- - + A + I = [1,1,... ,1] has all of its bits set 
equal to I . We also remind the reader again that p = N + 1. 

Addition is simply addition of vectors with coordinates in GF(2), so there is 
nothing further to say. 

The squaring operation is an interleaving permutation of the coefficients. 

Bit-Level Procedure for Squaring in GF(2^) 

Input: a(X) 

Output: c(X)=a(X)^2 mod Phi(X) 

Step 1: b(X) :=a[0]+a[l]X^2+a[2]X^4+. . .+a[N/2]X^N 
Step 2 : c (X) : =a [N/2+1] X+a [N/2+2] X^3+ . . . +a [N] X^ (N-1) 

Step 3: c(X) :=b(X)+c(X) 

Step 4: if c[N]=l then c (X) : =c (X)+Phi (X) 

The multiplication operation in Rp is what is commonly known as a convo- 
lution product. Here is how multiplication works at the bit level. 
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Bit-Level Procedure for Multiplication in GF(2^) 

Input: a(X), b(X) 

Output: c(X)=a(X)b(X) mod Phi(X) 

Step 1: c(X):=0 

Step 2: for i=0 to p-1 do 

Step 3: if a[i]=l then c (X) : =c (X)+b(X) 

Step 4: cyclic shift c(X) right 1 bit 

Step 5: if c[p-l]=l then c (X) : =c (X)+Phi (X) 

The most time-consuming part of multiplication is Step 3, since this step is 
inside the main loop and requires p additions (i.e., each of the p coefficients of b 
must be added or XOR’d to the corresponding coefficient of c). For comparison 
purposes, the analogous routine using ONB has the equivalent of two Step 3’s, 
so it takes approximately twice as long (or alternatively requires twice as com- 
plicated a circuit). Similarly, Montgomery multiplication over GF(2^) has the 
equivalent of two Step 3’s, so also takes twice as long (cf. [13]). 

Computation of inverses is relatively straightforward. We give below a slight 
adaptation of Schroeppel, Orman, O’Malley, and Spatscheck’s “Almost Inverse 
Algorithm” [23] (with an improvement suggested by Schroeppel) that works 
quite well. The speed of the Inversion Procedure can be signihcantly enhanced 
by a number of implementation tricks, such as expanding the operations on 
6, c, /, g into inline loop-unrolled code. We refer the reader to [23] for a list of 
practical suggestions. 

Bit-Level Inversion Procedure in GF(2^) 

Input: a(X) 

Output: b(X)=a(X)^(-l) mod Phi(X) 

Step 1: Initialization: 

k:=0; b(X):=l; c(X):=0; 
f(X):=a(X); g(X) : =Phi (X) ; 

Step 2: do while f[0]=0 

Step 3: f (X) :=f (X)/X; k:=k+l; 

Step 4: do while f (X) !=1 

Step 5: if deg(f) < deg(g) then 

Step 6: exchange f and g; exchange b and c; 

Step 7: f (X) :=f (X)+g(X) 

Step 8: b(X) :=b(X)+c(X) 

Step 9: do while f[0]=0 

Step 10: f (X) :=f (X)/X; c (X) : =c (X) *X; k:=k+l; 

Step 11: b(X) :=b(X)/X^k modulo X^p-1 
Step 12: if b[p-l]=l then b(X) : =b(X)+Phi (X) 

Note that in Steps 3 and 10 of the inversion routine, f{X)/X is / shifted 
right one bit and c(X) ^ X is c shifted left one bit. Further, Step 11 in the 
Inversion Procedure is simply the cyclic shift 

[bp— I , bp— 2 , . . . , , ^o] ' ^ [^k — l 7 ^k—2 , • • • 7 ^0 7 ^p — 1 7 • • • 7 ^p-k-\-l 7 ^p—k] • 
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It is instructive to compare the simplicity of this step with the description [23, 
page 51] of how to compute X~^b{X) when X^ — 1 is replaced by a trinomial, 
even if the trinomial is selected to make this operation as simple as possible. 

Finally, we describe how to solve a quadratic equation + z -h c = 0 
in GF(2^). The equivalent formula shows that the solution 

may be obtained recursively using the relation 

+ C2i, 0 < i < yv, 



where it is understood that the indices are always reduced modulo p into the 
range [0, A^j. In particular, putting i = 0 shows that a necessary condition for a 
solution to exist in is cq = 0. 

Bit-Level Quadratic Formula in GF(2^) 



Input : 
Output : 
Step 1 : 
Step 2: 
Step 3: 
Step 4: 
Step 5: 
Step 6: 
Step 7 : 
Step 8: 
Step 9: 
Step 10: 
Step 11: 



c(X) 

z(X) satisfying z (X) "2+z (X)+c (X)=0 mod Phi(X) 
if c[0]=l then c (X) : =c (X)+Phi (X) 
z[0] :=0; z[l] :=1; j :=1; 
do N-1 times 

i:=j 

j : =2*j mod p 
z[j] :=z[i]+c[j] 
if z [N/2+1] =z [1] +c [1] then 

if z[N]=l then z (X) : =z (X)+Phi (X) 
return z(X) 
else 

return "Error: z"2+z+c=0 not solvable" 



4 Selection of Good Fields GF(2^) 

Our first requirement in choosing GF(2^) is that p = N is prime and 2 is a 
primitive root modulo p, since this ensures that the cyclotomic polynomial 

yAT+l _ -1 

<P{x) = x^ + x^-^ + • • • + + a: + 1 = — 

is irreducible in GF(2)[A^|. 

Table 1 lists all primes in the intervals [100,300], [650,850], and [1000,1200] 
for which ^{X) is irreducible in GF(2) [A^]. It is clear that there are lots of primes 
with this property. (Mathematical Aside: A conjecture of Emil Artin says that 
there are infinitely many primes p with this property. Artin’s conjecture has not 
been proven unconditionally, but Hooley [12] has shown that Artin’s conjecture 
would follow from the Riemann hypothesis.) 

If one is merely interested in working in a field GF(2^) having a very fast 
multiplication method, then any of the primes in Table 1 will work (taking 
N = p— 1). For example, this is the case for the many cryptographic applications 
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Table 1. Some primes p with 4^{X) irreducible in GF(2)[X] 



101, 107, 131, 139, 149, 163, 173, 179, 181, 197, 211, 227, 269, 293 
653,659,661,677,701,709,757,773,787,797,821,827,829 
1019, 1061, 1091, 1109, 1117, 1123, 1171, 1187 



that use elliptic curves over finite fields. For elliptic curve cryptography, one 
might take N to be one of the values 162, 172, 178, 180, or 196. 

On the other hand, if one wishes to use the discrete logarithm problem (DLP) 
in GF( 2 ^), for example with Diffie- Heilman key exchange or the ElGamal pu- 
blic key cryptosystem, then there is another very important issue to consider. 
The group of non-zero elements in the field GF( 2 ^) is a cyclic group of order 
2^ — 1, and if 2^ — 1 factors as a product of small primes, the Pohlig- Heilman 
algorithm [ 20 ] gives a reasonably efficient way to solve the DLP in GF( 2 ^). 

To investigate prime divisors of 2^ — 1, we begin with the factorization of 
— 1 as a product of cyclotomic polynomials, 

-l = l[4>aiX). 

d\N 

Here <Pd{X) is the cyclotomic polynomial. That is, ^d{X) is the polynomial 
whose roots are the primitive roots of unity, 

<Pa{X)= 11 {^X- . 

l<k<d, gcd(/c,(i) = l 

The polynomial <Pd{X) has integer coefficients and is irreducible in Q[Jf|. We will 
not need to use any special properties of the #d’s, but for further information 
on cyclotomic polynomials, see for example [25]. 

For cryptographic purposes, we want to choose a value for N so that 2^ — 1 
is divisible by a large prime. We always have the factorization 

2 ^-i=n<i>d( 2 ), 

d\N 

so we look for cyclotomic polynomial values ^d( 2 ) that have large prime divisors. 

The problem of factoring numbers of the form 2^ — 1 has a long history. In- 
deed, the Cunningham Project set itself the long-term task of factoring numbers 
of the form 6^ zb 1 . Current results on the Cunningham Project are available 
on the web at [4]. The following two examples were devised using material from 
that site, but we include sufficient information here so that the interested reader 
can check that our examples have the stated properties. 

Example 1. For our first example we take p = 787 and N = 786. Since 786 is 
divisible by 393, we see that 2^^^ — 1 is divisible by ^ 393 ( 2 ) = 

Cunningham Project archive says that ^ 393 ( 2 ) factors into primes as 

^ 393 ( 2 ) = 36093121 . 51118297 • 58352641 • q. 
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Here q ^ 10^^ is a prime. Hence 2^^^ — 1 is divisible by the large prime g, 

so GF( 2 ^^^) is a suitable field for use with Diffie-Hellman and other schemes that 
depend on the intractability of the discrete logarithm problem. 

Example 2. As a second example, consider p = 1019 and N = 1018. Since 1018 
is divisible by 509, we see that 2^^^^ — 1 is divisible by # 509 ( 2 ) = 2^^^ — 1. From 
the Cunningham Project archive, we find that 2 ^^^ — 1 factors into primes as 

2509 12619129 • 19089479845124902223 • 647125715643884876759057 • cp 

Here q ^ 20^^^-^^ is prime. Hence 2^^^^ — 1 is divisible by the large 

prime g, so GF( 2 ^^^^) is a suitable field for use with Diffie-Hellman and other 
schemes that depend on the intractability of the discrete logarithm problem. 

Remark 9. There are many other p’s listed in Table 1 with the property that 
2 ^’“! — 1 is divisible by a large prime. We have merely presented two examples 
for which GF(2^) has approximately the same number of elements as the “First 
and Second Oakley Groups” described in [10]. However, we note that the discrete 
logarithm problem in GF(2^) may be easier to solve than in GF(p) for p ^ 2^ ^ 
see for example [3,9,16,17,22]. 
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construction described here should be credited to Ito and Tsujii. 
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Abstract. Conversion of finite field elements from one basis represen- 
tation to another representation in a storage-efficient manner is crucial 
if these techniques are to be carried out in hardware for cryptographic 
applications. We present algorithms for conversion to and from dual of 
polynomial and dual of normal bases, as well as algorithms to convert to 
a polynomial or normal basis which involve the dual of the basis. This 
builds on work by Kaliski and Yin presented at SAC ’98. 



1 Introduction 

Conversion between different choices of basis for a finite field is an important 
problem in today’s computer systems, particularly for cryptographic operati- 
ons [1]. While it is possible to convert between two choices of basis by matrix 
multiplication, the matrix may be too large for some applications, hence the mo- 
tivation for more storage-efficient techniques. The most likely such application 
would be in special-purpose hardware devices, but there are others as well. 

The paper of Kaliski and Yin [2] introduced the shift-extract and technique 
of basis conversion, and also gave several storage-efficient algorithms based on 
those techniques for converting to a polynomial or normal basis. In this paper, 
we introduce techniques involving the dual of a polynomial or normal basis, in- 
cluding storage-efficient generation of a dual basis and storage-efficient shifting 
in such a basis. The new techniques result in several new storage-efficient algo- 
rithms for converting to and from the dual of a polynomial or normal basis, as 
well as additional algorithms for converting to a polynomial or normal basis. 

2 Background 

Elements of a finite field can be represented in a variety of ways, depending on 
the choice of basis for the representation [3]. Let GF(q^) be the finite field, and 
let GF{q) be the ground field over which it is defined, where g is a prime or a 
prime power. We say that the characteristic of the field is p where q = for 
some r > 1. For even-characteristic fields, we have p = 2. The degree of the field 
is m; its order is q^ . 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 135-143, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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A basis for the finite field is a set of m elements ooq, . . . G GF{q^) 

such that every element of the finite field can be represented uniquely as a linear 
combination of basis elements. We write 

e = B[0]ujq + + • • • + B[m — 

where . . . , — 1 ] G OF{q) are the coefficients. 

Two common types of basis are a polynomial basis and a normal basis. In 
a polynomial basis, the basis elements are successive powers of an element 7 , 
called the qenerator: 

In a normal basis, the basis elements are successive exponentiations of an element 
7 , again called the generator: 




Another common type of basis is a dual basis. Let loq, . . . be a basis 

and let be a nonzero linear function from GF{jf^) to GF(q)^ i.e., a function 
such that for all e,^ G GF{q^) and c G GF{q), h{e = h{e) + h{(j)) and 

h[ce) = ch{e). The dual basis of the basis ct;o, . . . , with respect to the 

function h is the basis r/o, . . . , ??m-i such that for 0 < i, j < m — 1 , 

h{uOir]j) = 1 if i = j, 0 otherwise. 

Duality is symmetric: the dual basis with respect to h of the basis r/o, . . . ,Pm-i 
is the basis A dual basis can be defined for a polynomial basis, 

a normal basis, or any other choice of basis, and with respect to a variety of 
functions. 

The basis conversion or change- of -basis problem is to compute the represen- 
tation of an element of a finite field in one basis, given its representation in 
another basis. The problem has two forms, where we distinguish between the 
internal basis in which finite field operations are performed, and the external 
basis to and from which we are converting: 

— Import problem. Given an internal basis and an external basis for a finite 
field GF(q'^) and the representation of a field element in the external basis 
(the external representation)^ determine the corresponding representation A 
of the same field element in the internal basis (the internal representation) . 

— Export problem. Given an internal basis and an external basis for a finite 
field GF{q^) and the internal representation A of a field element, determine 
the corresponding external representation B of the same field element. 

Normally, the import and export problem could be solved by using a change 
of basis matrix, which requires storage for 0(m) field elements. Since each field 
element consists of m base field coefficients, this is O(m^) coefficients. In constrai- 
ned environments, this may be too large. What we want are algorithms which 
require storage for 0(1) field elements or 0(m) coefficients. The algorithms given 
in this paper for dual bases satisfy this requirement. 




Efficient Finite Field Basis Conversion Involving Dual Bases 



137 



3 Overview of Techniques 

In the following, the dual of a polynomial basis is called a polynomial* basis, 
and the dual of a normal basis is called a normal* basis. 

3.1 Import Algorithms 

Given an internal basis and an external basis for a finite field and the represen- 
tation B of 8i field element in the external basis, an import algorithm determines 
the corresponding representation A of the same field element in the internal 
basis. 

Two general methods for determining the internal representation A are de- 
scribed: the generate- accumulate method and the shift-insert method. 

Generate- Accumulate method The generate- accumulate method computes 
the internal representation A by accumulating the products of coefficients B[i] 
with successive elements of the external basis. The basic form of the algorithm 
for this method is as follows: 

proc ImportByGenAccum 

0 

for i from 0 to m — 1 do 

A^A^B\i] X Wi 

endfor 

endproc 

As written, this algorithm requires storage for the m values Wq, . . . , which 

are the internal representations of the elements of the external basis. To reduce 
the storage requirement, it is necessary to generate the values as part of the 
algorithm. This is straightforward when the external basis is a polynomial basis 
or a normal basis. For polynomial* and normal* bases, algorithms are given in 
this paper. 

Shift-Insert method The shift-insert method computes the internal represen- 
tation A by “shifting” an intermediate variable in the external basis and inserting 
successive coefficients between the shifts. This follows the same concept as the 
shift-extract method below. Let Shift be a function that shifts an element in the 
external basis, i.e., a function such as one which given the internal representation 
of an element with external representation 

{B[0],B[l],...,B[m-2],B[m-l]) 

computes the internal representation of the element with external representation 

{B[m — l],i^[0], . . . ,B[m — 3],B[m — 2]). 

(Other forms of shifting are possible, including shifting in the reverse direction, 
or shifting where the value 0 rather than B[m— 1] is shifted in.) 

The basic form of algorithm for this method is as follows ([2], Sec. 3.2, 3.3): 
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proc ImportByShiftInsert 

for i from m — 1 downto 0 do 

Shift(A) 

A^A^B\i] X TKo 

endfor 

endproc 

The direction of the for loop may vary depending on the direction of the shift. 
One advantage of the shift-insert method over the generate- accumulate method 
is that with a minor increase in storage, this algorithm can be parallelized. That 
is, if Wq and lT ^/2 are available, two elements can be inserted per shift. Since 
the shift is the most work-intensive part of the algorithm, this aids efficiency. 
This improvement is further discussed in [2]. 

3.2 Export Algorithms 

Given an internal basis and an external basis for a finite field and the represen- 
tation A of a field element in the internal basis, an export algorithm determines 
the corresponding representation B of the same field element in the internal 
basis. 

Two general methods for determining the external representation B are de- 
scribed: the generate*-evaluate method and the shift-extract method. 



Generate*-Evaluate method The generate*-evaluate method computes the 
external representation B by evaluating products of A with successive elements 
of a dual of the external basis. For example, the following equation gives the ith 
coefficient of the external representation: 

B[i] = h{AXi) 

where is a linear function and Xq, . . . , are the internal-basis representa- 

tions of the elements of the dual of the external basis with respect to the function 
h. The basic form of algorithm for this method is as follows: 

proc ExportByGen*Eval 

for i from 0 to m — 1 do 

T^Ax Xi 

B[i] ^ h{T) 

endfor 

endproc 

This algorithm requires storage for the m values Aq, . . . , A^_i, which are the 
internal represenations of the dual of the external basis. As was the case for 
ImportByGenAccum, to reduce the storage requirement, it is necessary to 
generate the values as part of the algorithm. 
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Shift-Extract method The Shift-Extract method computes the external re- 
presentation A by shifting an intermediate variable in the external basis and ex- 
tracting successive coefficients between the shifts. This follows the same concept 
as the shift-insert method above, with a similar Shift function and an Extract 
function that obtains a selected coefficient of the external representation. (The 
Extract function is similar to the h function in the previous method.) 

The basic form of algorithm for this method is as follows ([2], Sec. 3.4, 3.5): 

proc ExportByShiftExtract 

for i from m — 1 downto 0 do 

B[i] ^ Extract (A) 

Shift(A) 

endfor 

endproc 

Again, the direction of the for loop may vary depending on the direction of the 
shift. As with the shift-insert method above, the shift-extract method can be 
parallelized to extract multiple coefficients per iteration. 

3.3 Summary 

For these methods to accomplish our goal of being storage efficient, we depend 
on the efficiency of some additional functions. For the generate*-evaluate and 
generate-accumulate methods, we need an efficient dual basis generator. For the 
shift-insert and shift-extract methods, we need an efficient Shift function that 
works when the external basis is a normal* or polynomial* basis. An efficient 
Extract function (and hence an efficient method of evaluating a linear function 
h) is given in [2] (cf. Lemma 3). 

4 Polynomial* Basis Techniques 

This section discusses the structure of a polynomial* basis, and presents an 
efficient basis generation function and an efficient external shift function. 

Theorem 1. Let 1,7, .. . he a polynomial basis for GF{q^), and let h{e) 

be a linear function from GF{q^) to GF{q). Let ho{e) be the 1-coefficient of the 
representation of the element t in the polynomial basis. Let ( be the element of 
GF{q^) such that /iq(C^) = h{e). A formula for the dual basis r/o, . . . ,Pm-i of 
this polynomial basis with respect to h is 

Vi = C^^i, 



where = 1 and Ci = 7 - h-oil 

Proof. We first observe that the value ^ exists since there is a one-to-one corre- 
spondence between linear functions and field elements (cf. Lemma 3 of [2]). To 
prove the correctness of the formula, we use the definition of the dual basis and 
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induction. First, we consider r/o and observe that which is 1 if 

i = 0 and 0ifl<i<m— 1, meeting the definition. Now suppose we know that 
for j > 0 the first j — 1 elements are correct elements of the dual basis. Then we 
get the following for the jth. element: 

hirNj) = 

For i = 0, this reduces to = 0. For 1 < i < m — 1, the 

equation becomes ho(7*^^^j-i), which by induction is 1 if i = j and 0 if i ^ j. 
In both cases the definition is met. 

In the following algorithms, Z will be the internal representation of G will 
be the internal representation of 7, and / will be the internal representation 
of the identity element. The value Vq corresponds to the element such that 
{A X Vo)[0] = ho{A). The value Z corresponds to the function h{t) = ho{(e)^ 
and contains the information specific to the choice of dual basis in the following 
algorithms. Note that if Z is 0, h{e) = /io(0) = 0, and therefore, h would not be 
a nonzero linear function. Thus, we can assume Z is nonzero. 

4.1 GenPoly* 

The algorithm GenPoly* generates the internal representation of the dual basis 
elements. GenPoly* is an iterator] it is meant to be called many times in 
succession. The first time an iterator is called, it starts from the iter line. When 
a yield statement is reached, the iterator returns the value specified by the yield. 
The next time the iterator is called, it starts immediately after the last yield 
executed; all temporary variables are assumed to retain their values from one 
call to the next. An iterator ends when the enditer line is reached. 

iter GenPoly* 

W ^ 1 
yield Z~^ 

for i from 1 to m — 1 do 

IT ^ IT X G-i 
T^W X To 

IT ^ W -T[0] X 1 

yield IT x Z ^ 

endfor 

enditer 

4.2 ShietPoly* 

With our knowledge of the formula for a polynomial* basis, we can also devise 
a method for shifting an element’s representation in the polynomial* basis. The 
algorithm simply uses the recursive formula for generating from namely 



s(e) = 7 ^e- hoX L). 
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Theorem 2. s performs an external shift in the polynomial"^ basis with respeet 
to ho, sueh that s(^i) = for 0 < i < m — 2 and s{^rn-i) = 0- 

Proof First we observe that s is linear, i.e., that for all e, ^ G GF{q^), c G 
GF{q), s{e-\-(f)) = s(e) + s(^) and s(ce) = cs(e). Since s is linear, we only have to 
show that applying it to basis elements is correct. Since s is merely the recursive 
formula for generating from ^i-i, we know it is correct for all basis elements 
except Cm — 1 • Thus, it remains to show that s(Cm-i) = 0. To see this, define Cm 
as s(Cm-i) and apply the equation from the proof above: 

h-oif^m) = - hoiYho{'y~^^m-i))- 

For i = 0, this cancels to 0. For 1 < i < m — 1, the equation becomes 
^o(7*~^Cm-i), which is 0. Since the values /^o(7^Cm) as i varies correspond to 
coefficients of the representation of Cm ia ffi^ basis Co, • • • ,Cm-i and they are all 
zero, it follows that Cm = 0- 

Since the dual basis r/o, . . . ,Vm-i is just the basis Co, • • • ,Cm-i scaled by 
shifting in the dual basis r/o, • • • ,r/m-i is accomplished by computing the 
function (~^s{(e). The following is an algorithm for shifting in the polynomial* 
basis based on this technique. 

proc ShiftPoly* (A) 

^ X ZG~^ 

T ^ Ax Vo 
A^A-T[0] xl 

A ^ A X 

endproc 

Also note that we can use ShiftPoly* to make a new version of GenPoly* 
that generates by repeated shifting. 

5 Techniques Involving the Dual of a Normal Basis 

This section discusses the structure of a normal* basis, and presents an efficient 
basis generation function and an efficient external shift function. 

Theorem 3. Let 7 , . . . , 7 ^”^ ^ be a normal basis for GF{q^), and let h{e) be 
a linear funetion from GF{q^) to GF{q). Let ho{t) be the '^-coefficient of the 
representation of the element t in the normal basis. Let ( be the element of 
GF{q^) sueh that ho {(e) = h{e). A formula for the dual basis r/o, . . . , of 
this normal basis with respeet to h is 

Vi = CXi, 

where Co = 1 C* = where a is the element such that /io(7^ <7) is 

1 for i = 1 and 0 for i = 0 and 2 < i < m — 1. 




142 



B.S. Kalis ki Jr. and M. Liskov 



Proof. First, we observe that C and a exist, the latter being an element of the 
dual basis with respect to ho. We also observe that ho{ae^) = ho{e) for all 
e G GF{q). To prove that the formula is correct, we use the definition of the dual 
basis and induction. First, we consider rjo and observe that /?'( 7 ^% 7 o) = ^ 0 ( 7^') 5 
which is 1 if i = 0 and Oifl < i < m— 1 , meeting the definition. For the 
induction step, we get the following for the jth element, where j > 0: 

For i = 0, the equation becomes which by induction is 0. (Note 

that 7 = 7 ^’^.) For 1 < i < m — 1, it becomes /^o( 7 ^" which by induction 

is 1 if i j and 0 if i ^ j. In both cases the definition is met. 

In the following algorithms, G will be the internal representation of 7 , 5 will 
be the internal representation of a, and Z will be the internal representation of 
C- As before, we can assume Z is nonzero. 

5.1 GenNormal* 

Now that we know the general formula for the dual of a normal basis, we can 
demonstrate a technique for efficiently generating the dual of a normal basis. 
Like GenPoly*, GenNormal* is written as an iterator. 

iter GenNormal* 

w ^ s 

yield T 

for i from 1 to m — 1 do 

T^T X W 

yield T 

endfor 

enditer 

Theorem 4. The iterator GenNormal* generates the elements of the normal* 
basis. 

Proof. After the first iteration, GenNormal* outputs the internal representa- 
tion of At each successive step, the basis is multiplied by successively higher 
powers of q of a, so we get +q+i^ and so on. By our 

formula, this is the correct list of normal* basis elements. 

5.2 ShietNormal* 

There is also an efficient method for doing a rotation of an element in the normal* 
basis. The algorithm simply uses the recursive formula for generating from 
namely 



s(e) = aG . 
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Theorem 5. s performs an external shift (aetually, a rotation) in the normal"^ 
basis with respeet to ho, sueh that s(^i) = for 0 < i < m — 2 and = 

Co* 

Proof. As before, we first observe that s is linear. We again only have to show 
that that the formula is correct for Cm-i* To see this, define as and 

apply the equation from the proof above: 

For i = 0, the equation becomes ho{j^"^ which is 1. For 1 < i < m — 1, 

it becomes /^o(7^ Cm-i), which is 0. Since the values /io(7^ Cm) as i varies 
correspond to coefficients of the representation of in the basis Co, • • • :Cm-i 
and they are all zero except for the ^Q-coefficient, which is 1, it follows that 
Cm = Co- 

Shifting in the dual basis r/o, . . . , is accomplished by computing the fun- 
ction as before. Based on this, we have the algorithm ShiftNormal*. 

proc ShiftNormal* (A) 

A ^ A X SZ^-^ 

endproc 

Note that this only requires storage for one value, as SZ^~^ can be precom- 
puted. Also note that we can use ShiftNormal* to make a new version of 
GenNormal*. 

6 Conclusion 

We have demonstrated efficient algorithms for external shifting and efficient ba- 
sis generation in the polynomial* and normal* bases. Using these algorithms 
in the storage-efficient basis conversion methods described above, we can imple- 
ment the following basis conversion methods: ImportByShiftInsert, Import- 
ByGenAccum, and ExportByShietExtract for a polynomial* or normal* 
external basis, and ExportByGen*Eval for a polynomial or normal basis. 
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Abstract. Three new types of power analysis attacks against smartcard imple- 
mentations of modular exponentiation algorithms are described. The first attack 
requires an adversary to exponentiate many random messages with a known and 
a secret exponent. The second attack assumes that the adversary can make the 
smartcard exponentiate using exponents of his own choosing. The last attack 
assumes the adversary knows the modulus and the exponentiation algorithm 
being used in the hardware. Experiments show that these attacks are successful. 
Potential countermeasures are suggested. 



1 Introduction 

Cryptographers have been very successful at designing algorithms that defy traditional 
mathematical attacks, but sometimes, when these algorithms are actually implemented, 
problems can occur. The implementation of a cryptographic algorithm can have weak- 
nesses that were unanticipated by the designers of the algorithm. Adversaries can 
exploit these weaknesses to circumvent the security of the underlying cryptographic 
algorithm. Attacks on the implementations of cryptographic systems are a great concern 
to operators and users of secure systems. Implementation attacks include power analy- 
sis attacks [1,2], timing attacks [3,4], fault insertion attacks [5,6], and electromagnetic 
emission attacks [7]. Kelsey et al. [8] review some of these attacks and refer to them as 
“side-channeT’ attacks. The term “side-channef’ is used to describe the leakage of unin- 
tended information from a supposedly tamper-resistant device, such as a smartcard. 

In a power analysis attack the side-channel is the device’s power consumption. An 
adversary can monitor the power consumption of a vulnerable device, such as a smart- 
card, to defeat the tamper-resistance properties and learn the secrets contained inside 
the device [1]. Although it is preferable to design secure systems that do not rely on 
secrets contained in the smartcard, there are applications where this may not be possible 
or is undesirable. In these systems, if the secret, for instance a private key, is compro- 
mised, then the entire system’s security may be broken. 
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In this paper we examine the vulnerabilities of pub lie -key cryptographic algorithms 
to power analysis attacks. Specifically, attacks on the modular exponentiation process 
are described. These attacks are aimed at extracting the secret exponent from tamper- 
resistant hardware by observing the instantaneous power consumption signals into the 
device while the exponent is being used for the exponentiation. Experimental results on 
a smartcard containing a modular exponentiation circuit are provided to confirm the 
threats posed by these attacks. 

Three types of attacks are described that can be mounted by adversaries possessing 
various degrees of capabilities and sophistication. The first attack requires that, in addi- 
tion to exponentiating with the secret exponent, the smartcard will also exponentiate 
with at least one exponent known to the attacker. This attack, referred to as a “Single- 
Exponent, Multiple-Data” (SEMD) attack, requires the attacker to exponentiate many 
random messages with both the known and the secret exponent. The SEMD attack is 
demonstrated to be successful on exponentiations using a small modulus (i.e., 64 bits) 
with 20,000 trial exponentiations, but with a large modulus might require 20,000 expo- 
nentiations per exponent bit. The second attack we introduce requires that the attacker 
can get the smartcard to exponentiate using exponents of his own choosing. Our exper- 
iments showed that this attack, referred to as a “Multiple-Exponent, Single-Data” 
(MESD) attack, requires the attacker to run about 200 trial exponentiations for each 
exponent bit of the secret exponent. The last attack that we discovered does not require 
the adversary to know any exponents, but does assume the attacker can obtain basic 
knowledge of the exponentiation algorithm being used by the smartcard. With this 
attack, referred to as a “Zero-Exponent, Multiple-Data” (ZEMD) attack, we can suc- 
cessfully extract a secret exponent with about 200 trial exponentiations for each secret 
exponent bit. 

The organization of this paper is as follows; first, the related work is reviewed and 
the motivation for research into power analysis attacks is given. Next, implementations 
of modular exponentiation and the basic principles of power analysis attacks are 
reviewed. The equipment and software needed for these attacks is described and 
detailed descriptions of the MESD, SEMD and ZEMD attacks are given. Finally, poten- 
tial countermeasures are suggested. 

1.1 Related Work 

Previous papers that describe power analysis attacks mainly examine the security of 
symmetric key cryptographic algorithms. Kocher, et al. [1] review a Simple Power 
Analysis (SPA) attack and introduce a Differential Power Analysis (DPA) attack, 
which uses powerful statistical-based techniques. They describe specific attacks against 
the Digital Encryption Standard (DES) [9] , and their techniques can also be modified for 
other ciphers. Kelsey, et al. [8] show how even a small amount of side-channel infor- 
mation can be used to break a cryptosystem such as the DES. An alternate approach is 
taken in [2] , where techniques to strengthen the power consumption attack by maximiz- 
ing the side-channel information are described. The Advanced Encryption Standard 
(AES) candidate algorithms are analyzed in [10-12] for their vulnerabilities to power 
analysis attacks. These papers advise that the vulnerabilities of the AES algorithms to 
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power analysis attacks should be considered when choosing the next encryption stan- 
dard. 

1.2 Research Motivation 

Tamper-resistant devices, such as smartcards, can be used to store secret data such as a 
person’s private key in a two-key, public -key cryptosystem. Familiar examples of such 
systems are an RSA cryptosystem [13] and an elliptic -curve cryptosystem [14,15]. In a 
typical scenario, the owner of a smartcard needs to present the card in order to make a 
payment, log onto a computer account, or gain access to a secured facility. In order to 
complete a transaction, the smartcard is tested for authenticity by a hardware device 
called a reader. The reader is provided by a merchant or some other third party and, in 
general, may or may not be trusted by the smartcard owner. Thus, it is important that 
when a user relinquishes control of her smartcard, she is confident that the secrecy of 
the private key in the card can be maintained. In an RSA cryptosystem, the authenticity 
of the card is tested by asking the card to use its internally stored private key to modu- 
larly exponentiate a random challenge. Since it is possible that the card is being 
accessed by a malicious reader, the power consumption of the card during the exponen- 
tiation process should not reveal the secret key. 



2 Review of Modular Exponentiation Implementations 

Modular exponentiation is at the root of many two-key, public -key cryptographic 
implementations. The technique used to implement modular exponentiation is com- 
monly known as the “square-and-multiply” algorithm. Elliptic curve cryptosystems use 
an analogous routine called the “double-and-add” algorithm. Two versions of the 
square-and-multiply algorithm are given in Fig. 1. The first routine in Fig. 1, expl, 
starts at the exponent’s most significant nonzero bit and works downward. The second 
routine, exp2, starts at the least significant bit of the exponent e and works upward. Both 
routines are vulnerable to attack and both return the same result, M^mod N. Common 
techniques to implement modular exponentiation (i.e., particular implementations of 
the modular square and modular multiply operations) can be found in [16-21]. One pop- 



expl {M, e, N) 

{ R = M 

for (i = n-2 down to 0) 

{ R = mod N 

if (ith bit of e is a 1) 
R = R-M mod N } 
return R } 



exp2{M, e, N) 

{ R = 1 
S = M 

for (i = 0 to n-1) 

{ if (ith bit of e is a 1) 
R = R- S mod N 
S = mod N } 
return R } 



Fig. 1. Exponentiation Routines Using the Square-and-Multiply Algorithm 

Two versions of the square-and-multiply algorithm used for smartcard authentication are given 
above. The routine expl starts at the most significant bit and works down and the routine exp2 
does the opposite. The routine exp2 requires extra memory to store the S variable. The exponent, 
e, has n bits, where the least significant bit is numbered 0 and the most significant nonzero bit is 
numbered n-l 
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ular method to speed up the exponentiation is to use Montgomery’s modular multipli- 
cation algorithm [22] for all the multiplies and squares. 

The attacks in this paper are a potential threat to all of these implementations. The 
MESD and SEMD attacks are against the square-and-multiply method. Every imple- 
mentation executes the square-and-multiply method in some manner, so all are poten- 
tially vulnerable. The ZEMD attack works on intermediate data results and is only 
possible if an attacker possesses basic knowledge of the implementation. The attacker 
needs to know which of the types of square-and-multiply algorithms is being used and 
the technique used for the modular multiplications. Even if the attacker does not know 
the implementation, there is a fairly small number of likely possibilities. In our attack, 
all that was necessary to know was that the exponentiation was done using the expl 
algorithm and Montgomery’s method was used for the modular multiplies. 



3 Review of Power Analysis Attacks 

Power analysis attacks work by exploiting the differences in power consumption 
between when a tamper-resistant device processes a logical zero and when it processes 
a logical one. For example, when the secret data on a smartcard is accessed, the power 
consumption may be different depending on the Hamming weight of the data. If an 
attacker knows the Hamming weight of the secret key, the brute force search space is 
reduced and given enough Hamming weights of independent functions of the secret 
key, the attacker could potentially learn the entire secret key. This type of attack, where 
the adversary directly uses a power consumption signal to obtain information about the 
secret key is referred to as a Simple Power Analysis (SPA) attack and is described in [1] . 

Differential Power Analysis (DP A) is based on the same underlying principle of an 
SPA attack, but uses statistical analysis techniques to extract very tiny differences in 
power consumption signals. DPA was first introduced in [1] and a strengthened version 
was reported in [2]. 

3.1 Simple Power Analysis (SPA) 

SPA[1] on a single-key cryptographic algorithm, such as DES, could be used to learn 
the Hamming weight of the key bytes. DES uses only a 56-bit key so learning the Ham- 
ming weight information alone makes DES vulnerable to a brute-force attack. In fact, 
depending on the implementation, there are even stronger SPA attacks. A two-key, 
public -key cryptosystem, such as an RSA or elliptic curve cryptosystem, might also be 
vulnerable to an SPA attack on the Hamming weight of the individual key bytes, how- 
ever it is possible an even stronger attack can be made directly against the square-and- 
multiply algorithm. 

If exponentiation were performed in software using one of the square-and-multiply 
algorithms of Fig. 1, there could be a number of potential vulnerabilities. The main 
problem with both algorithms is that the outcome of the “if statemenf ’ might be observ- 
able in the power signal. This would directly enable the attacker to learn every bit of the 
secret exponent. A simple fix is to always perform a multiply and to only save the result 
if the exponent bit is a one. This solution is very costly for performance and still may 
be vulnerable if the act of saving the result can be observed in the power signal. 
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3.2 Differential Power Analysis (DPA) 

The problem with an SPA attack is that the information about the secret key is difficult 
to directly observe. In our experiments, the information about the key was often 
obscured with noise and modulated by the device’s clock signal. DPA can be used to 
reduce the noise and also to “demodulate” the data. Multiple-bit DPA [2] can be used 
to attack the DES algorithm by defining a function, say D, based on the guessed key bits 
entering an S-box lookup table. If the D-function predicts high power consumption for 
a particular S-box lookup, the power signal is placed into set If low power con- 
sumption is predicted, then the signal is placed into an alternate set, 5"/^^. If the pre- 
dicted power consumption is neither high or low, then the power signal is discarded. 
The result of this partitioning is that when the average signal in set is subtracted 
from the average signal in set the resulting signal is demodulated. Any power 
biases at the time corresponding to the S-box lookup operation are visible as an obvious 
spike in the difference signal and much of the noise is eliminated because averaging 
reduces the noise variance. Correct guesses of the secret key bits into an S-box are ver- 
ified by trying all 2^ possibilities and checking which one produces the strongest differ- 
ence signal. 

All of the attacks described in this paper use averaging and subtracting and so are 
similar to a DPA attack. The averaging reduces the noise and the subtracting demodu- 
lates the secret information and enhances the power biases. 



4 Power Analysis Equipment 

A smartcard with a built-in modular exponentiation circuit was used to evaluate the 
attacks described in this paper. The exponentiation circuit on this smartcard is a typical 
implementation of the square-and-multiply algorithm using a Montgomery multiplica- 
tion circuit to speed up the modular reductions. The exponentiation circuit was accessed 
via a software program residing in the card’s memory. This software executed a simple 
IS07816 smartcard protocol [23] which supports a command similar to the standard 
“internal authenticate” command. 



5 Attacking a Secret Exponent 

The objective of the attacks described in this paper is to find the value of e, the secret 
exponent stored in the smartcard’s internal memory. The attacker is assumed to have 
complete control of the smartcard. He can ask the card to exponentiate using e and can 
monitor all input and output signals. The card will obey all commands of the attacker, 
except a command to output the secret key. The main command that is needed is the 
“internal authenticate” command which causes the card to receive an input value, M, 
and output M^mod N. Some smartcard systems require the user to enter a Personal Iden- 
tification Number (PIN) prior to allowing access to the card. This feature is not consid- 
ered in our attacks. Also, the number of times the attacker can query the card is assumed 
unlimited. All of these assumptions are reasonable since smartcard systems have been 
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implemented that allow such access. Other assumptions used for particular attacks are 
stated in the sections that describe the specific attack details. 

5.1 A Simple Correlation Experiment 

We performed a correlation experiment to determine if e could be revealed by simply 
cross-correlating the power signal from a single multiply operation with the entire expo- 
nentiation’ s power signal. This attack was designed to see how easy it is to distinguish 
the multiplies from the squares, thus revealing the bits of e. Let the multiply’s power 
signal be and the exponentiation’s power signal be 5"g[/]. The cross-correlation 
signal, is calculated as 

w 

X - 0 

where W is the number of samples in the multiply’s power signal. That is, W ^ TJT^ 
where is the time needed for a multiply operation and T is the sampling rate. An 
attacker can learn the approximate value of W through experimentation or from the 
smartcard’s documentation. 

The power signals and cross-correlation signal obtained from running this experi- 
ment are shown in Fig. 2. The exponentiation and multiply power signals were obtained 
by running the smartcard with constant input data and averaging 5,000 power signals to 
reduce the measurement noise. This experiment was first tested on a known exponent, 
so the locations of the squares and multiplies are known and are labeled in the Fig. 2. 
The resulting cross-correlation signal shows peaks at the locations of the individual 
squares and multiplies, but the height of the peaks are uncorrelated with the type of 
operation. Thus, this cross-correlation technique is not useful to differentiate between 
squares and multiplies. However, it is interesting to point out that the time needed for 
each operation in the square-and-multiply algorithm can be determined from the cross 



Exponentiation 
Power Signal: 

Multiplication 
Power Signal: 



Cross-Correlation 
Signal: 

Fig. 2. Cross-Correlation of Multiplication and Exponentiation Power Signals 

The above signals were obtained using the power analysis equipment described in Section 4. 
The signals were averaged for 5,000 exponentiations using a constant input value. The results 
show an ability to determine the time between the square-and-multiply operations, but cannot 
be used to distinguish multiply operations from squaring operations 
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correlation signal. This information could lead to a combined power analysis and timing 
attack in implementations where the time to multiply is slightly different than the time 
to square. Such an attack would be more powerful than previously documented timing 
attacks because the cross-correlation signal would yield the timing of all intermediate 
operations. Fortunately, the smartcards we examined do not have this problem. 

5.2 Single-Exponent, Multiple-Data (SEMD) Attack 

The SEMD attack assumes that the smartcard is willing to exponentiate an arbitrary 
number of random values with two exponents; the secret exponent and a public expo- 
nent. Such a situation could occur in a smartcard system that supports the IS078 16 [23] 
standard “external authenticate” command. Whereas the “internal authenticate” com- 
mand causes the smartcard to use its secret key, the “external authenticate” command 
can be used to make the smartcard use the public key associated with a particular smart- 
card reader. It is assumed that the exponent bits of this public key would be known to 
the attacker. 

The basic premise of this attack is that by comparing the power signal of an expo- 
nentiation using a known exponent to a power signal using an unknown exponent, the 
adversary can learn where the two exponents differ, thus learn the secret exponent. In 
reality, the comparison is nontrivial because the intermediate data results of the square- 
and-multiply algorithm cause widely varying changes in the power signals, thereby 
making direct comparisons umeliable. The solution to this problem is to use averaging 
and subtraction. This simple DPA technique begins by using the secret exponent to 
exponentiate L random values and collects their associated power signals, Si\j]. Like- 
wise, L power signals, P^[/], are collected using the known exponent. The average sig- 
nals are then calculated and subtracted to form D[/], the DPA bias signal, 

L L 

Dm = I 

/ - 1 / - 1 

The portions of the signals S\j] and P\j] that are dependent on the intermediate data will 
average out to the same constant mean |li, thus: 

S[j] - P[y] - |LL if 2 ^ a data dependent sample point 

The portion of the signals S[j] and P[j] that are dependent on the exponent bits will aver- 
age out to different values, Pg or depending on whether a square or multiply oper- 
ation is performed. Thus, if and p^„ are not equal, then their difference will be 
nonzero and the DPA bias signal, D\j], can be used to determine the exact location of 
the squares and multiplies in the secret exponent: 

D[j] - if j = data dependent point or exponentiation operations agree 

□nonzero if j = point where the exponentiation operations differ 

The SEMD attack was performed on a smartcard and the result is shown in Fig. 3. 
For this experiment the exponentiation was simplified by using a modulus and data with 
only 64 bits. This simplification was done only for illustrative purposes. Using smaller 
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Fig. 3. Single-Exponent, Multiple-Data (SEMD) Attack Results 

The above plot is the DPA signal comparing the exponentiation power signal produced with a 
known exponent and an unknown exponent. The energy in the DPA signal is greater when the 
two exponent operations are different. The shaded horizontal bars show the output of an 
integrate-and-dump filter indicating the energy associated with each interval of time. The 
above signal was obtained using 20,000 trial exponentiations 



data made it possible to store more of the exponentiation’s power signal in the digital 
oscilloscope, so many more exponent bits could be attacked with each test. In an actual 
attack against full-sized data, one would likely be able to attack only a small portion of 
the exponentiation at a time. The number of bits attacked at one time depends on the 
size of the memory in the attacker’s digital oscilloscope. The DPA signal in Fig. 3 
shows an attack on about 16 exponent bits and was obtained withZ= 10,000; thus 20,000 
trial exponentiations were needed. In a real attack, using full-sized data, this attack 
might need 20,000 exponentiations for each exponent bit. In this case a sliding window 
approach is needed, where only a windowed portion of the power trace is attacked at 
one time. 

The DPA bias signal in Fig. 3 is labeled to show the squares (S) and multiplies (M) 
associated with the secret and known exponents. The regions where these operations 
differ exhibit a corresponding increase in the amplitude of the DPA bias signal. An inte- 
grate-and-dump filter was used to compute the signal energy associated with each 
region and the output of the filter is graphed as shaded horizontal line segments in 
Fig. 3. The output of the integrate-and-dump filter is given as: 

^ouAn= □ where, = max(xV,/,p 

i - JT 

m 

In this equation, is chosen to eliminate the overwhelming influence of spurious 

spikes in the bias signal. The final result shows that the output of the integrate-and- 
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dump filter is good at indicating where the secret and known exponent operations differ. 
Thus, the SEMD attack is an important attack that implementors of smartcard systems 
need to consider when designing a secure system. 

5.3 Multiple-Exponent, Single-Data (MESD) Attack 

The MESD attack is more powerful than the SEMD attack, but requires a few more 
assumptions about the smartcard. The previously described SEMD attack is a very 
simple attack requiring little sophistication on the part of the adversary, but the resulting 
DPA bias signal is sometimes difficult to interpret. The Signal-to-Noise Ratio (SNR) 
can be improved using the MESD attack. The assumption for the MESD attack is that 
the smartcard will exponentiate a constant value ^ using exponents chosen by the 
attacker. Again, such an assumption is not umeasonable since some situations might 
allow the smartcard to accept new exponents that can be supplied by an untrusted entity. 
Also, the smartcard does not have unlimited memory, so it is impossible for it to keep 
a history of previous values it has exponentiated. Thus, the card cannot know if it is 
being repeatedly asked to exponentiate a constant value. 

The algorithm for the MESD attack is given in Fig. 4. The first steps of the algo- 
rithm are to choose an arbitrary value, M, exponentiate M using the secret exponent e, 
and then collect the corresponding average power signal S^\j]. Next, the algorithm 
progresses by successively attacking each secret exponent bit starting with the first bit 
used in the square-and-multiply algorithm and moving towards the last. To attack the 
/th secret exponent bit, the adversary guesses the /th bit is a 0 and then a 1 and asks the 
card to exponentiate using both guesses. It is assumed that the adversary already knows 
the first through (;-l)st exponent bits so the intermediate results of the exponentiation 
up to the (/-l)st exponent bit will be the same for the guessed exponent and the secret 
exponent. If the adversary guesses the /th bit correctly, then the intermediate results will 
also agree at the /th position. If the guess is wrong, then the results will differ. This dif- 
ference can be seen in the corresponding power traces. Let be the current guess for 
the exponent. The average power signals for exponentiating Musing an with the /th 

M = arbitrary value and 6g = 0 

Collect S]v[[j] 

for (i = n-1 to 0) 

{ guess (ith bit of Og is a 1) and collect Si[j] 
guess (ith bit of eg is a 0) and collect So[j] 

Calculate two DPA bias signal: 

Di[j] = S]v[[j] - Si[j] and Do[j] = S^[j] - So[j] 

Decide which guess was correct using DPA result 
update eg } 

eg is now equal to e (the secret exponent) 

Fig. 4. Algorithm for the Multiple-Exponent, Single Data (MESD) Attack 

This algorithm gradually makes Cg equal to the secret exponent bit, by using the DPA signal to 
decide which guess is correct at the /th iteration 



^ This value may or may not be known to the attacker. 
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bit equal to 1 is [/] and an with the /th bit equal to 0 is [/] . Two DP A bias signals 
can be calculated: 

-^i[y] and D^[j] = S^[j] -S^[j] 

Whichever exponent bit was correct produces a power signal that agrees with the secret 
exponent’s power signal for the larger amount of time. Thus, whichever bias signal is 
zero for a longer time corresponds to the correct guess. 

The resulting bias signal for a correct and incorrect guess are shown in Fig. 5. It is 
clear that the SNR in Fig. 5 is much improved over the SEMD attack. The higher SNR 
of the MESD attack means that fewer trial exponentiations are needed for a successful 
attack. Also, an experienced attacker really needs to calculate only one DPA bias signal. 
For example, the attacker could always guess the exponent bit is a 1 . If the guess is cor- 
rect, the DPA bias signal will remain zero for the duration of a multiply and a square 
operation. If the guess is wrong, then the bias signal will only remain zero for the dura- 
tion of the square operation. This technique effectively cuts the running time of the 
algorithm of Fig. 4 in half. Our experiments showed that as few as 100 exponentiations 
were needed per exponent bit. Memory limitations in a digital oscilloscope also might 
require a moving window approach to collect the secret exponent’s power signal, thus 
resulting in 200 exponentiations per exponent bit. The circumstances allowing an 
MESD attack definitely need to be addressed by implementors concerned with power 
analysis attacks. 

5.4 Zero-Exponent, Multiple-Data (ZEMD) Attack 

The ZEMD attack is similar to the MESD attack, but has a different set of assumptions. 
One assumption for the ZEMD attack is that the smartcard will exponentiate many 
random messages using the secret exponent. This attack does not require the adversary 
know any exponents, hence the zero-exponent nomenclature. Instead, the adversary 
needs to be able to predict the intermediate results of the square-and-multiply algorithm 
using an off-line simulation. This usually requires that the adversary know the algo- 
rithm being used by the exponentiation hardware and the modulus used for the expo- 
nentiation. There are only a few common approaches to implementing modular 
exponentiation algorithms, so it is likely an adversary can determine this information. 
It is also likely that the adversary can learn the modulus because this information is usu- 
ally public. 



Incorrect Guess : 



Correct Guess : 

Fig. 5. Multiple-Exponent, Single-Data (MESD) Attack Results 

The above plot is the DPA signal obtained when the next bit is guessed correctly compared to 
when the next bit guess is wrong. The correct guess is clearly seen to be the signal that remains 
zero the longest. This signal was obtained using 1 ,000 trial exponentiations 
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The algorithm for the ZEMD attack is given in Fig. 6. The ZEMD attack starts by 
attacking the first bit used during the exponentiation and proceeds by attacking each 
successive bit. In this algorithm, the variable gradually becomes equal to the secret 
exponent. After each iteration of the attack, another exponent bit is learned and is 
subsequently updated. At the ;th iteration of the algorithm, it is assumed that the correct 
exponent, is correct up to the (/-l)st bit. The algorithm then guesses that the /th bit 
of the secret exponent is a 1 and a DPA bias signal is created to verify the guess. The 
DPA bias signal is created by choosing a random input, M, and running a simulation to 
determine the power consumption after the multiply in the /th step of the square-and- 
multiply algorithm. This simulation is possible because the exponent is known up to 
the /th bit and the power consumption can be estimated using the Hamming weight of 
a particular byte in the multiplication result. Previous power analysis experiments 
showed that a higher number of ones correspond to higher power consumption. 

If the multiply at the /th step actually occurred, then the power analysis signals can 
be accurately partitioned into two sets, thereby creating biases (or spikes) in the DPA 
bias signal when the average signals in each partition are subtracted. If the guess is 
incorrect, then the partitioning will not be accurate and the power biases will not occur. 
A natural error-correcting feature of this algorithm is that if there is ever a mistake, all 
subsequent steps will fail to show any power biases. 

The ZEMD attack was implemented and an example of the DPA bias signals for a 
correct and an incorrect guess are given in Fig. 7. The DPA bias signals in Fig. 7 were 
generated using an 8-bit partitioning function based on the Hamming weight of the mul- 
tiplication result. Power signals corresponding to results with Hamming weight eight 
were subtracted from power signals corresponding to results with Hamming weight 
zero. This partitioning technique creates a larger SNR and is further described in [2]. 

&g = 0 

for (i = n-1 to 0) 

{ guess (ith bit of is a 1) 
for {k = 1 to L) 

{ choose a random value: M e 

g 

Simulate to the ith set the calculation of M mod N 
if (multiplication result has high Hamming weight) 
run smartcard and collect power signal: S[j] 
add S[j] to set 

if (multiplication result has low Hamming weight) 
run smartcard and collect power signal: S[j] 
add S[j] to set } 

Average the power signals and get DPA bias signal: 

if DPA bias signal has spikes 

the guess was correct: make ith bit of equal to 1 
else 

the guess was wrong: make ith bit of equal to 0 } 
is now equal to e (the secret exponent) 

Fig. 6. Algorithm for the Zero-Exponent, Multiple Data (ZEMD) Attack 

This algorithm gradually makes equal to the secret exponent bit, by using the DPA signal to 
decide if the guess of the /th bit being a one is correct 
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Guess was 
incorrect (e, = 0): 

Guess was . I I 

correct (e, = 1): 

Fig. 7. Zero-Exponent, Multiple-Data (ZEMD) Attack Results 

The above plot is the DPA signal comparing the DPA bias signal produced when the guess of 
the /th exponent bit is correct compared to when it is incorrect. The spikes in the correct signal 
can be used to confirm the correct guess. This signal was obtained using 500 trial 
exponentiations 



The signals in Fig. 7 were obtained by averaging 500 random power signals, but we 
have also been able to mount this attack with only 100 power signals per exponent bit. 

In the attack we implemented it was necessary to collect power signals using a win- 
dowing approach. This meant it was necessary to collect new power signals for each 
exponent bit being attacked. With optimizations to the equipment and algorithm, more 
exponent bits could be attacked simultaneously requiring even fewer trial exponentia- 
tions. The exact number of trial exponentiations necessary is dependent on the equip- 
ment of the adversary, the size of the power biases, and the noise in the signals. 
Implementors need to keep the ZEMD attack in mind when designing modular expo- 
nentiation hardware and software. 



6 Countermeasures 

Potential countermeasures to the attacks described in this paper include many of the 
same techniques described to prevent timing attacks on exponentiation. Kocher’s [3] 
suggestion for adapting the techniques used for blinding signatures [24] can also be 
applied to prevent power analysis attacks. Prior to exponentiation, the message could 
be blinded with a random value, v. and unblinded after exponentiation with 
Vj^ - (v7^)^modA . Kocher suggests an efficient way to calculate and maintain (v^, vy) 
pairs. 

Message blinding would prevent the MESD and ZESD attacks, but since the same 
exponent is being used, the SEMD attack would still be effective. To prevent the SEMD 
attack, exponent blinding, also described in [3], would be necessary. In an RSA crypto- 
system, the exponent can be blinded by adding a random multiple of ([)(7V) , where 
([)(A) = (/7 - 1)(^ - 1) and N=pq. In summary, the exponentiation process would go 
as follows: 

1. Blind the message M: M = (v^M) mod N 

2. Blind the exponent e\ e ^ e + r([)(7V) 

/V /V e 

3. exponentiate: S = {M ) mod N 

4. unblind the result: S = {vjS) mod N 

Another way to protect against power analysis attack is to randomize the exponen- 
tiation algorithm. One way this can be accomplished is to combine the two square-and- 
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multiply algorithms of Fig. 1. A randomized exponentiation algorithm could begin by 
selecting a random starting point in the exponent. Exponentiation would proceed from 
this random starting point towards the most significant bit using exp2 of Fig. 1. Then, 
the algorithm would return to the starting point and finish the exponentiation using expl 
and moving towards the least significant bit. It would be difficult for an attacker to 
determine the random starting point from just one power trace (an SPA attack), so this 
algorithm would effectively randomize the exponentiation. The amount of randomiza- 
tion that is possible depends on the number of bits in the exponent. For large exponents 
this randomization might be enough to make power analysis attacks impractical to all 
but the most sophisticated adversaries. All the attacks presented in this paper would be 
significantly diminished by randomizing the exponentiation. 



7 Conclusions 

The potential threat of monitoring power consumption signals to learn the private key 
in a two-key, public -key cryptosystem has been investigated. A variety of vulnerabili- 
ties have been documented and three new attacks were developed. The practicality of 
all three attacks was confirmed by testing on actual smartcard hardware. Table 1 sum- 
marizes the attacks and some of the assumptions and possible solutions. 

The goal of this research is to point out the potential vulnerabilities and to provide 
guidance towards the design of more secure tamper-resistant devices. Hopefully the 
results of this paper will encourage the design and development of solutions to the prob- 
lems posed by power analysis attacks. 



TABLE 1: Summary of Power Analysis Attacks on Exponentiation 



Attack 

Name 


Number of trial 
exponentiations 


Assumptions 


Possible 

Solution 


SEMD 


20,000 


attacker knows one exponent 


exponent 

blinding 


MESD 


200 


attacker can choose exponent 


message 

blinding 


ZEMD 


200 


attacker knows algorithm and modulus 


message 

blinding 
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Abstract. Paul Kocher recently developped attacks based on the el- 
ectric consumption of chips that perform cryptographic computations. 

Among those attacks, the “Differential Power Analysis” (DPA) is pro- 
bably one of the most impressive and most difficult to avoid. 

In this paper, we present several ideas to resist this type of attack, and 
in particular we develop one of them which leads, interestingly, to rather 
precise mathematical analysis. Thus we show that it is possible to build 
an implementation that is provably DPA-resistant, in a “local” and re- 
stricted way (i.e. when - given a chip with a fixed key - the attacker 
only tries to detect predictable local deviations in the differentials of 
mean curves). We also briefly discuss some more general attacks, that 
are sometimes efficient whereas the “original” DPA fails. Many measures 
of consumption have been done on real chips to test the ideas presented 
in this paper, and some of the obtained curves are printed here. 

Note: An extended version of this paper can be obtained from the authors. 

1 Introduction 

This paper is about a way of securing a cryptographic algorithm that makes use 
of a secret key. More precisely, the goal consists in building an implementation 
of the algorithm that is not vulnerable to a certain type of physical attacks - 
so-called “Differential Power Analysis” . 

These DPA attacks belong to a general family of attacks that look for infor- 
mation about the secret key by studying the electric consumption of the electro- 
nic device during the execution of the computation, fn this family, we usually 
distinguish between SPA attacks (“Simple Power Analysis”) and DPA attacks. 

fn SPA attacks, the aim is essentially to guess - from the values of the 
consumption - which particular instruction is being computed at a certain time 
and with which input or output, and then to use this information to deduce some 
part of the secret. Figure 1 shows the electric consumption of a chip, measured 
during a DES computation on a real smartcard. The fact that the 16 rounds of 
the DES algorithm are clearly visible is a good sign that power analysis attacks 
may indeed provide information about what the chip is doing. 

* Patents Pending 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 158-172, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 




DES and Differential Power Analysis 159 



230 




150 



0 5000 10000 15000 20000 25000 30000 35000 40000 



Fig. 1. Electric consumption measured on the 16 rounds of a DES computation 



In DPA attacks, some differentials on two sets of average consumption are 
computed, and the attacks succeed if an unusual phenomenon appears - on 
these differentials of consumption - for a good choice of some of the key bits 
(we give details below), so that we are able to find out those key bits. What 
makes DPA attacks so impressive, when they work, is the fact that they can 
find out the secret key of a public algorithm (for example DES, but also many 
other algorithms) without knowing anything (nor trying to find anything) about 
the particular implementation of that algorithm. Implementations exist that are 
DPA-resistant (differentials do not show anything special) but not SPA-resistant 
(some critical information can be deduced from the consumption curves). On 
the contrary, other implementations exist that are SPA-resistant but not DPA- 
resistant (some critical information can be found by studying differentials of two 
mean curves of consumption). Finally, some implementations can be found that 
resist both types of attack (at least at the present), or none of them. 

Throughout this paper, we study more particularly DPA and we will not deal 
any longer with SPA. Indeed, as we see below, DPA can easily be analyzed in a 
mathematical way (and not only in an empirical way). There exist many attacks 
based on the electric consumption. We do not claim to give here solutions to all 
the problems that may result from these attacks. 

The cryptographic algorithms we consider here make use of a secret key in 
order to compute an output information from an input information. It may be a 
ciphering, a deciphering or a signature operation. In particular, all the material 
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described in this paper applies to “secret key algorithms” and also to the so- 
called “public key algorithms” . 



2 The “Differential Power Analysis” Attacks 

The “Differential Power Analysis” attacks, developped by Paul Kocher and Cryp- 
tographic Research (see [1]), start from the fact that the attacker can get many 
more information (than the knowledge of the inputs and the outputs) during 
the execution of the computation, such as for instance the electric consumption 
of the microcontroller or the electromagnetic radiations of the circuit. The “Dif- 
ferential Power Analysis” (DPA to be brief) is an attack that allows to obtain 
information about the secret key (contained in a smartcard for example), by per- 
forming a statistical analysis of the electric consumption records measured for 
a large number of computations with the same key. Let us consider for instance 
the case of the DES algorithm (Data Encryption Standard). It executes in 16 
steps, called “rounds”. In each of these steps, a transformation F is performed 
on 32 bits. This F function uses eight non-linear transformations from 6 bits to 
4 bits, each of which is coded by a table called “S-box”. The DPA attack on 
the DES can be performed as follows (the number 1000 used below is just an 
example) : 

Step 1: We measure the consumption on the first round, for 1000 DES computa- 
tions. We denote by ..., E\ooo the input values of those 1000 computations. 
We denote by Ci, ..., Giooo fhe 1000 electric consumption curves measured du- 
ring the computations. We also compute the “mean curve” MC of those 1000 
consumption curves. 

Step 2: We focus for instance on the first output bit of the first S-box during the 
first round. Let b be the value of that bit. It is easy to see that b depends on only 
6 bits of the secret key. The attacker makes an hypothesis on the involved 6 bits. 
He computes - from those 6 bits and from the Fi - the expected (theoretical) 
values for b. This enables to separate the 1000 inputs ..., E\ooo into two 
categories: those giving 6 = 0 and those giving 6 = 1. 

Step 3: We now compute the mean MC' of the curves corresponding to inputs 
of the first category (i.e. the one for which 6 = 0). If MC and MC' show an 
appreciable difference (in a statistical meaning, i.e. a difference much greater 
than the standard deviation of the measured noise), we consider that the chosen 
values for the 6 key bits were correct. If MC and MC' do not show any sensible 
difference, we repeat step 2 with another choice for the 6 bits. 



Note: In practice, for each choice of the 6 key bits, we draw the curve repre- 

senting the difference between MC and MC' . As a result, we obtain 64 curves, 
among which one is supposed to be very special, i.e. to show an appreciable 
difference, compared to all the others. 
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Step 4: We repeat steps 2 and 3 with a “target” bit b in the second S-box, then 
in the third S-box, until the eighth S-box. As a result, we finally obtain 48 
bits of the secret key. 

Step 5: The remaining 8 bits can be found by exhaustive search. 

Note: It is also possible to focus (in steps 2, 3 and 4) on the set of the four 

output bits for the considered S-boxes, instead of only one output bit. This is 
what we actually did for real smartcards. In that case, the inputs are separated 
into 16 categories: those giving 0000 as output, those giving 0001, ..., those 
giving 1111. In step 3, we may compute for example the mean MC' of the curves 
corresponding to the last category (i.e. the one which gives 1111 as output). As 
a result, the mean MC' is computed on approximately ^ of the curves (instead 
of approximately half of the curves with step 3 above): this may compel us to 
use a number of DES computations greater than 1000, but it generally leads to 
a more appreciable difference between MC and MC' . 

We presented in figures 2 and 3 two mean curves, resulting from steps 2 and 3, 
for a classical implementation of DES on a real smartcard (with Dill’ as target 
output of the first S-box and with 2048 different inputs, even if we noted that 
512 inputs are sufficient). A detailed analysis of the 64 obtained curves (that we 
cannot all print here, due to the lack of place) shows that the one corresponding 
to the correct choice of the 6 key-bits can easily be detected (it contains much 
greater peaks than all the others). 




Fig. 2. An example of difference of the curves MC and MC' when the 6 bits are false 



162 



L. Goubin and J. Patarin 




Fig. 3. Difference of the curves MC and MC' when the 6 bits are correct 



This attack does not require any knowledge about the individual electric 
consumption of each instruction, nor about the position in time of each of these 
instructions. It applies exactly the same way as soon as the attacker knows the 
outputs of the algorithm and the corresponding consumption curves. It only 
relies on the following fundamental hypothesis: 

Fundamental hypothesis; There exists an intermediate variable, that ap- 
pears during the computation of the algorithm, such that knowing a few key bits 
(in practice less than 32 bits) allows us to decide whether two inputs (respectively 
two outputs) give or not the same value for this variable. 

All the algorithms that use S-boxes, such as DES, are potentially vulnerable 
to the DPA attack, because the “natural” implementations generally remain 
within the hypothesis mentioned above. 



3 Securing the Algorithm 

Several countermeasures against DPA attacks can be conceived. For instance: 

1. Introducing random timing shifts, so that the computed means do not cor- 
respond any longer to the consumption of the same instruction. The crucial 
point consists here in performing those shifts so that they cannot be easily 
eliminated by a statistical treatment of the consumption curves. 
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2. Replacing some of the critical instructions (in particular the basic assembler 
instructions involving writings in the carry, readings of data from an array, 
etc) by assembler instructions whose “consumption signature” is difficult to 
analyze. 

3. For a given algorithm, giving an explicit way of computing it, so that DPA 
is provably unefhcient on the obtained implementation. For instance, for a 
DES-like algorithm, we detail in section 4 how to compute the non-linear 
transformations of the S-boxes in order to avoid some DPA attacks. 

In the present paper, we essentially study the third idea because it leads to 
a quite precise mathematical analysis. We give in this section a general method 
to implement an algorithm with a secret key so as to avoid the DPA attacks 
described above. The basic principle consists in programming the algorithm so 
that the fundamental hypothesis above is not true any longer (i.e. an interme- 
diate variable never depends on the knowledge of an easily accessible subset of 
the secret key). 



The Main Idea 

In this paper, we mainly study how this can be done by using the following main 
idea: replacing each intermediate variable F, occuring during the computation 
and depending on the inputs (or the outputs), by k variables Fi, ..., F/^., such 
that Vi, V 2 , ..., 14 allows us - if we want - to retrieve V. More precisely, to 
guarantee the security of the algorithm in its new form, it is sufficient to choose 
a function / satisfying the identity F = /(Fi, ..., F/^), together with the two 
following conditions: 

Condition 1; From the knowledge of a value v and for any hxed value i, 1 < 
i < k, it is not feasible to deduce information about the set of the values Vi such 
that there exist a {k — l)-upie (t?i, ..., t?/c) satisfying the equation 

f{vi,...,Vk) = V. 

Condition 2; The function f is such that the transformations to be performed 
on Fi, F 2 , ..., or Vj. during the computation (instead of the transformations 
usually performed on V ) can be implemented without calculating F. 

First example for condition 1 : If we choose 0 0 • • • 0 5 

where 0 denotes the bit-by-bit “exclusive-or” function, condition 1 is obviously 
satisfied, because - for any fixed index i between 1 and k - the considered set of 
the values Vi contains all the possible values and thus does not depend on v. 

Second example for condition 1: If we consider some variable F whose 

values lie in the multiplicative group of Z/nZ, we can choose the function 
/(t?i, ...jVk) = v\ 'V 2 ' ... * Vk mod n, where the new variables t?i, t? 2 , ..., Vk also 
have values in the multiplicative group of Condition 1 is also obviously 

true because - for any fixed index i between 1 and k - the considered set of the 
values Vi contains all the possible values and thus does not depend on v. 
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We then “translate” the algorithm by replacing each intermediate variable 
V depending on the inputs (or the outputs) by the k variables Vi, \4. In the 
following sections, we study how conditions 1 and 2 can be achieved in the case 
of the DES or RSA algorithms. 

4 The DES Algorithm: First Example of Implementation 
for DPA Resistance 

In this section, we consider the particular case of the DES algorithm. We choose 
here to separate each intermediate variable E, occuring during the computation 
and depending on the inputs (or the outputs), into two variables Vi and V 2 (i.e. 
we take k = 2). Let us choose the function f{vi^V 2 ) = v = vi (B V 2 (see the 
first example of section 3), which satisfies condition 1. From the construction of 
the algorithm, it is easy to see that the transformations performed on v always 
belong to one of the five following categories: 

1. permutation of the bits of v; 

2. expansion of the bits of v; 

3. “exclusive-or” between v and another variable v' of the same type; 

4. “exclusive-or” between v and a variable depending only on the key; 

5. transformation of v using a S-box. 

The first two cases correspond to linear transformations on the bits of the 
variable v. Eor these ones, condition 2 is thus very easy to satisfy: we just have 
- instead of the transformation usually performed on i; - to perform the permu- 
tation or the expansion on Vi, then on V 2 , and the identity /(t^i,t^ 2 ) = which 
was true before the transformation, is also true afterwards. 

In the same way, in the third case, we just have to replace the computation of 
v" = vB) v' by the computation of v'( = 0 v'^ and v '2 = ® ^ 2 - The identities 

/(^i,^ 2 ) = ^ and indeed f{v'lB^ 2 ) — ^ condition 2 

is true again. 

As concerns the exclusive-or between v and a variable c depending only on 
the key, condition 2 is also very easy to satisfy: we just have to replace the 
computation of t? 0 c by 0 c (or V 2 0 c) and that gives condition 2. 

Einally, instead of the non-linear transformation v' = S(v)^ given under 
the form of a S-box (which in that example has 6-bits inputs and 4-bits out- 
puts), we implement the transformation (v'i^v' 2 ) = S'{vi^V 2 ) by using two new 
S-boxes (each of which sending 12 bits onto 4 bits). In order to keep the identity 
f{v[,V 2 ) = E, we may choose: 

(v[,V2) = S'(vi,V2) = (A(vi,V2),S(vi 0 V 2 ) 0 A{vi,V2)). 

where A denotes a randomly chosen secret transformation from 12 bits to 4 
bits (see figure 4). The first of the new S-boxes corresponds to the table of the 
transformation (t^i,t^ 2 ) ^ A(vi,V 2 ), and the second one corresponds to the table 
of the transformation (t^i, 1 ^ 2 ) *S(t?i 0 1 ^ 2 ) 0 A(vi^V 2 ). Thanks to the randomly 
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v '2 = S{vi ©^ 2 ) ® A{vi,V 2 ) 



Modified implementation: the values v = v\ ^ V 2 and 
v' = never explicitely appear in RAM 



Fig. 4. Standard transformation of a S-box 
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chosen function A, condition 1 is satisfied. Moreover, the use of tables allows us 
to avoid the computation of vi 0 1 ^ 2 , so that condition 2 is also true. 

The solution presented in this section is quite realistic for chips that compute 
DES in hardware (and are not embedded in a card), or for PCs, because - in 
those cases - enough memory is available. More precisely, the size of the memory 
required to store the S-boxes is 32 Kbytes for the method described in this 
section. It is too much for smartcards, for which specific variations using less 
memories are described in section 5 below. 



5 Smartcard Implementations of DES 

First Variation 

In order to reduce the ROM used by the algorithm, it is quite possible to use 
the same random function A for the eight S-boxes (of the initial description of 
the DES), so that we have only nine (new) S-boxes (i.e. 18 Kbytes) to store in 
ROM, instead of sixteen S-boxes. 



Second Variation 

In order to reduce the size of the ROM needed to store the S-boxes, we can also 
use the following method: instead of each non-linear transformation v' = S{v) of 
the initial implementation, given under the form of a S-box (with 6-bits inputs 
and d-bits outputs in the case of the DES), we implement the transformation 
(^ 17 ^ 2 ) = *S"(t?i,i; 2 ) by using two S-boxes, each of which sending 6 bits onto 4 
bits. The initial implementation of the computation v' = S(v) is replaced by the 
two following successive computations: 



- Vo = ^^{vi 0 V 2 ) 

- = S'(vi,V2) = © ^(^^o)) 

where c/p is a bijective and secret function from 6 bits to 6 bits and where A 
denotes a random and secret transformation from 6 bits to 4 bits. The first 
of the two new S-boxes corresponds to the table of the transformation vq 
M(t?o) and the second one corresponds to the table of the transformation vq 
S'(99-i(wo)) ^ 3 ^ 4 .( 770 ). Erom this construction, the identity y(^]^ 7 ^ 2 ) — ^ alv^ays 
true. Thanks to the random function M, condition 1 is satisfied. Moreover, the 
use of tables allows us to avoid the computation of = t;i 0 t; 2 , so that 

condition 2 is also true. This solution (shown in figure 5) requires 512 bytes to 
store the S-boxes. 

In order to satisfy condition 2, it remains to choose the bijective transforma- 
tion (/p such that the computation of Vq = ^^(viQv 2 ) is feasible without computing 
vi Q V 2 - We give below two examples of possible choice for the function (/?. 
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V 




v' = S(v) 



Initial implementation: the predictable values 
V and v' appear in RAM at some time 



Vi V2 




Modified implementation: the values v = vi (B V 2 and 
v' = v'l (B v '2 never explicitely appear in RAM 



Fig. 5. Transformation of a S-box (second variation) 
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is invertible. It corresponds to the 



Exemple 1: a linear bijection 

We choose (/? as a linear secret and bijective function from 6 bits to 6 bits (we 
consider the set of the 6-bits values as a vectorial space of dimension 6 on the 
finite field F 2 with 2 elements). In practice, choosing cp is equivalent to choosing 
a random and invertible 6x6 matrix whose coefficients are 0 or 1. With this 
choice of (/?, it is easy to see that condition 2 is satisfied. Indeed, to compute 
(f{vi (B V 2 ), we just have to compute then (f{v 2 ) and finally to compute 

the “exclusive-or” of the two obtained results. 

/l 1 0 1 00\ 

110101 

For instance, the matrix 1 1 1 q 1 0 invertible. It corresponds to the 

011110 

\001 101 / 

linear bijection (f from 6 bits to 6 bits defined by U2, U3, U4, U5, ue) = (ui 0 

U2 0U4, U10U2 0U40U6, U20U30U5, U10U20U30U5, U2 0U3 0U40U5, U3 0U40U6). 

Let vi = and V 2 = ^ 2 , 2 , ^ 2 , 3 , ^ 2 , 4 , ^ 2 , 5 , 

t? 2 , 6 )- To compute ip{vi 0 1 ^ 2 ), we successively compute: 

— 0 Vi^2 ® '^1,4, ® "^1,2 ® "^1,4 ® "^1,67 "^1,2 ® "^1,3 ® '^1,67 '^ 1,1 ® 

"^ 1,2 ® "^1,3 ® '^ 1,67 "^1,2 ® "^1,3 ® "^1,4 ® '^ 1,67 "^1,3 ® "^1,4 ® '^1,6) 

— ^( 1 ^ 2 ) — ® '^ 2,2 ® "^ 2 , 47 '^ 2,1 ® '^ 2,2 ® "^ 2,4 ® '^ 2 , 6 ? "^ 2,2 ® "^ 2,3 ® '^ 2 , 57 '^ 2,1 ® 

r^2,2 ® "^2,3 ® '^2,5 ? ”^2,2 ® "^2,3 ® "^2,4 ® '^2,5 ? ”^2,3 ® "^2,4 ® "^2,6) 

Then we compute the “exclusive-or” of the two obtained results. 

Exemple 2: a quadratic bijection 

We choose (/? as a quadratic secret and bijective function from 6 bits to 6 bits. 
Here, “quadratic” means that each bit of the output is given by a polynomial 
function of total degree two of the 6 bits of the input (which are identified to 6 
elements of the finite field F 2 ). In practice, we may choose the function (p defined 
by p(x) = t(s(x)^), where s is a secret linear bijection from (F 2 )^ to £, t is a 
secret linear bijection from £ to (F 2 )^ and £ denotes an algebraic extension of 
degree 6 over the finite field F 2 . The bijectivity of this function p follows from 
the fact that a 1-0 is a bijection on the extension £ (whose inverse is 6 1-0 6^^). 
To be convinced that condition 2 is still satisfied, just notice that we can write: 



£{vi 0 V 2 ) = ij{vi,Vi) 0 'lp(vi,V 2 ) 0 'lp(v 2 ,Vi) 0 t^(t^2,^2), 



where the function £ is defined by £{x^y) = t(s(x)^ 
For instance, if we identify £ io¥ 2 [X]/ ^ X ^ 



whose matrices are 



H 1 0 1 00 
110101 
011010 
111010 
011110 
001101 



010011 

110100 

101011 

011100 

101010 

001011 



s(y))- 

1) and if we choose s and t 



with respect to the basis 
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{I, X , X‘^ , , X^) of C over F 2 and to the canonical basis of (F 2 )^ over F 2 , 

we obtain the following quadratic bijection (/? from 6 bits to 6 bits: 

\ , U2 ? "^3 ? "^4 ? "^5 ? ~ {^ 2^5 ® 'U]_'U4 0 'U4 0 Uq 0 UqU2 0 U4UQ 0 U2 0 ® ® 

W 4 W 3 , W 2 ^5 0 ^5^1 0 0 ^4 0 ^6 0 W 4 W 5 0 ^2 0 ^3 0 ^3^1 , ^ 2 ^5 0 ^5^1 0 ^6^5 0 

W4 0 W3W5 0 0 U4UQ 0 UqUs 0 W4W3 0 UsUi , W4 0 U2US 0 UqUi 0 U4UQ 0 W5 0 

UqUs 0 U^Us.U^Ui 0 U1U4 0 ^6 0 U3U5 0 ^4^5 0 0 ^ 6^1 0 ^4^6 0 ^3 0 ^6^3 0 

W 4 ^ 2,^4 0 ^6 0 U3U5 0 Wl 0 U4UQ 0 ^6^3)- 

To compute (f{vi 0 t? 2 ), we use the function ^ip{x,y) = t{s{x)^ • s{y)) from 12 
bits to 6 bits, which gives the 6 output bits from the 12 input bits as follows: 

t/;(xi , X 2 , X 3 , X 4 , X 5 , ^6 , yi , ^2 , ^3 , ^4 , ys , ye ) = (^3^5 0 ^ 6^2 0 ^6^3 0 ^6^4 0 ^3^1 0 
^eVi 0 ^iZ/3 0 ^iZ/5 0 ^5Z/2 0 ^5Z/5 0 ^sZ/i 0 ^eVe 0 ^iVe 0 ^iZ/2 0 ^iZ/4 0 ^ 2^/1 0 
^2V2 0 ^4Z/4 0 ^SVS 0 ^SVe 0 ^4Z/3 0 ^5Z/3 , ^4Z/5 0 ^3^1 0 ^GVI 0 ^2Z/5 0 ^SZ/l 0 ^6^/6 0 
^lZ/6 0 ^lZ/2 0 ^2Vl 0 ^2V2 0 ^ 4 Vl 0 ^4Z/4 0 ^SVS , ^6Z/2 0 ^gVS 0 ^GVa 0 ^6Z/5 0 ^SVl 0 
Xeyi 0 ^2Z/5 0 ^SZ/l 0 ^lZ/6 0 ^lZ/1 0 ^lZ/2 0 ^lZ/4 0 ^2yi 0 ^2Z/4 0 ^4Z/2 0 ^2Z/6 0 
^3Z/4 0 ^5Z/3 , ^SVl 0 ^eZ/2 0 ^2Z/6 0 ^5Z/3 0 ^5Z/4 0 ^5Z/6 0 ^6Z/3 0 ^2Z/3 0 ^4Z/6 0 ^6Z/5 0 
^lZ/3 0 ^5Z/5 0 ^2Z/4 0 ^4Z/2 0 ^4Z/5 0 ^3Z/5 0 ^4Z/3 0 ^eZ/l 0 ^4^1 , ^3^1 0 ^6^/6 0 ^5Z/3 0 
^5Z/6 0 ^5Z/2 0 ^lZ/5 0 ^lZ/1 0 ^lZ/2 0 ^2^/1 0 ^ 2 V 3 0 ^3Z/6 0 ^6Z/5 0 ^iVs 0 ^2Z/4 0 
^sVs 0 ^ 4 Z /5 0 ^2Z/5 0 ^eZ/i 0 ^ 4 yi 0 ^eZ /4 0 ^3Z/2 , ^eZ/e 0 ^ 4 Z /4 0 ^5^4 0 x^ye 0 xeys 0 
^lZ/6 0 ^lZ/1 0 ^lZ/2 0 ^2Z/1 0 ^eZ/5 0 ^2Z/4 0 ^4Z/2 0 ^4Z/5 0 ^3Z/5 0 ^6Z/l 0 ^ 6 ^/ 4 ). 

By using these formulas, we successively compute ^(^ 1 ,^ 2 ), ^(^ 2 ,^ 1 ) 

and ^tp{v2,V2). Finally, we compute the “exclusive-or” of the four obtained results. 

Third Variation 

To further reduce the size of the ROM needed to store the S-boxes, we can apply 
simultaneously the ideas of both variations 1 and 2 : we use the second variation, 
with the same secret bijection (f (from 6 bits to 6 bits) and the same secret 
random function A (from 6 bits to 6 bits) in the new implementation of each 
non-linear transformation given by a S-box. This variation thus requires only 
288 bytes to store the S-boxes. We have applied the Differential Power Analy- 
sis on real smartcard implementations of this third variation. Two examples of 
differential mean curves (with 2048 inputs and with THE as target output of 
the first S-box) are presented in figures 6 and 7. A precise analysis of the 64 
curves given by the DPA (see note after step 3, in section 2) shows that none of 
them appears to be “very special”, compared to the others, so that we can say 
that this implementation resists the DPA attack (at least in its basic form, see 
appendix 2 for a possible generalization that could still be dangerous). 



Fourth Variation 

fn this last variation, instead of implementing the transformation (vi^v' 2 ) = 
S'(vi^V 2 ) (which replaces the non-linear transformation v' = *S'(i;) of the initial 
implementation, given by a S-box) by using two S-boxes, we perform the com- 
putation of v[ (respectively V 2 ) by using a simple algebraic function (i.e. the bits 
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of v[ (respectively v' 2 ) are given by a polynomial function of total degree 1 or 2 
of the bits of v\ and V 2 )^ then we compute v '2 (respectively v'l) by using a table. 
This enables to reduce again the needed ROM for the implementation. This last 
variation requires only 256 bytes to store the S-boxes. 

6 The RSA Algorithm 

The “Power Analysis” attacks also threaten the classical implementations of the 
RSA algorithm. Indeed, these implementations often use the so-called “square- 
and-multiply^^ principle to perform the computation of mod a. ft consists in 
writing the binary decomposition d = +... + di2^ +do2^ 

of the secret exponent d, and then in performing the computation as follows: 

1. z ^ 1; 

For i going backwards from m — 1 to 0 do: 

2. z ^ z‘^ mod n; 

3. if = 1 then z ^ z x x mod n. 

In this computation, we see that - among the successive values taken by 
the z variable - the first ones depend on only a few bits of the secret key d. 
The fundamental hypothesis that enables the DPA attack is thus satisfied. As 
a result, we can guess for instance the 10 most significant bits of d by studying 
the consumption measures on the part of the algorithm corresponding to i going 
from m — Itom — 10. We can then continue the attack by using comsumption 
measures on the part of the algorithm corresponding to i going from m — 11 to 
m — 20, which gives the 10 next bits of d, and so on. We finally find all the bits 
of the secret exponent d. 

The method described in section 3 also applies to securing the RSA algorithm. 
We use here a separation of each intermediate variable V (whose values lie in the 
multiplicative group of TaIuTa)^ occuring during the computation and depending 
on the inputs (or the outputs), into two variables Vi and V 2 (i.e. we take A: = 2), 
and we choose the function /(t^i,t^ 2 ) = v = vi - V 2 mod n. We already saw in 
section 3 (cf “second example”) that this function / satisfies condition 1. 

We thus replace x by (xi,^ 2 ) such that x = xi - X 2 mod n and z by ( 2 ^ 1 , ^ 2 ) 
such that z = • ^2 mod n (in practice, we can for instance choose xi randomly 

and deduce X 2 ). Considering again the three steps of the “square- and-multiply” 
method, we perform the following transformations: 

1. 2 ; ^ 1 is replaced by ^ 1 and 2^2 ^ 1; 

2. z ^ z‘^ mod n is replaced by Zi y- z^ mod n and Z 2 ^ 2 t| mod n; 

3. z ^ zxx mod n is replaced by 2^1 ^ 2^1 x xi mod n and 2^2 ^ 2^2 x X 2 mod n. 

It is easy to check that the identity 2 r = / ( 2 ^ 1 , 2 ^ 2 ) remains true all along the 
computation, which shows that condition 2 is satisfied. 

Let us notice that the computations performed respectively on the Zi variable 
and on the 2^2 variable are completely independent. We thus can imagine to 
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perform the two computations either in a sequential way, or in an overlapped 
way, or simultaneously in the case of multiprogrammation, or simultaneously in 
different processors working concurrently. 

7 Generalized Attacks 

Recently, more general attacks were introduced, where the attacker tries to cor- 
relate different points of a power consumption curve. We have no place here to 
analyze in detail the effect of this idea on the “Duplication Method”. However, 
it is possible to show that if each variable is splitted in, say, k variables, then 
the complexity of the implementation increases in 0{k)^ while the complexity of 
the attack increases exponentially in k. 

As concerns DES implementations, we also recommend, when it is possible, 
to use different S-Boxes for each smartcard (stored in EEPROM). In particular, 
this avoids some attacks which use a smartcard with a known key to help finding 
the key in another smartcard whose key is unknown. 

8 Conclusion 

In this paper, we investigate how the study of the electric consumption measures 
of an electronic device can be used by an attacker to get information about the 
secret key of the cryptographic algorithm computed by the chip. More precisely, 
we focus on the so-called Differential Power Attacks, which were recently introdu- 
ced by Paul Kocher, and which use a statistical analysis of a set of consumption 
curves measured for many different inputs of the cryptographic algorithm. 

We study more precisely how DPA attacks work, and what precise hypo- 
theses they rely on. We then present several ways of securing cryptosystems. 
In particular, concrete examples of such countermeasures are described in the 
cases of DES and RSA, which are the most used cryptographic algorithms at 
the present. 

To secure those algorithms, we essentially study the main idea that consists 
in splitting each intermediate variable, occuring in the computation, into two 
(or more) variables, such that the values of these new variables cannot be easily 
predicted. The obtained implementations can be proved to resist the “local” 
version of Differential Power Analysis (where the attacker only tries to detect 
local deviations in the differentials of mean curves). Nevertheless other attacks 
can be conceived, still using the analysis of electric consumption. We do not 
pretend to solve all security problems linked to these threats. These latter attacks 
are not only theoretical, since we found real products that are defeated by them, 
but it also shows that theoretical investigations have to be continued in that 
sensitive subject. 

References 

1. Paul Kocher, Joshua Jaffe, Benjamin Jun, Introduction to Differential Po- 
wer Analysis and Related Attacks, 1998. This paper is available at 

http: //www.cryptography.com/dpa/technical/index.html 




IPA: A New Class of Power Attacks 



Paul N. Fahn* 
and Peter K. Pearson** 

Certicom Corp. 

25801 Industrial Blvd. 
Hayward, CA 94545, USA 



Abstract. We present Inferential Power Analysis (IPA), a new class of 
attacks based on power analysis. An IPA attack has two stages: a pro- 
filing stage and a key extraction stage. In the profiling stage, intratrace 
differencing, averaging, and other statistical operations are performed on 
a large number of power traces to learn details of the implementation, 
leading to the location and identification of key bits. In the key extrac- 
tion stage, the key is obtained from a very few power traces; we have 
successfully extracted keys from a single trace. Compared to differential 
power analysis, IPA has the advantages that the attacker does not need 
either plaintext or ciphertext, and that, in the key extraction stage, a 
key can be obtained from a small number of traces. 



1 Introduction 

Recent years have seen significant progress in what are called “power attacks” 
on cryptographic modules, attacks in which one monitors the power drawn by 
the module and from these measurements extracts some secret quantity that 
the module manipulates during some cryptographic operation. In 1998, Kocher 
et al. [5] described Differential Power Analysis (DPA), in which power measu- 
rements from many repeated cryptographic operations are cleverly combined. 
More recently. Biham and Shamir [I] showed how to derive key information by 
combining power measurements on different cryptographic modules. 

This paper describes a class of attacks called Inferential Power Analysis (IPA) 
attacks. An IPA attack is characterized by two stages, the first a lengthy pro- 
filing stage, and the second a simpler key extraction stage. The profiling step 
is typically based on comparisons of repeated parts of a selected cryptographic 
operation, such as the different rounds in a DES encryption. These comparisons 
can be performed on a single cryptographic module, requiring many measured 
operations, and result in a profile that can subsequently be used to extract keys 
from other modules using as little as a single cryptographic operation. Unlike 
DPA, these attacks do not require knowledge of the operation’s inputs or out- 
puts. 
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More generally, these attacks illustrate a class of attacks in which a one-time 
effort requiring just one module produces information with which keys can be 
easily extracted from other modules of the same design. Such attacks can be 
applied not only by a cardholder against a smart card in his possession, but also 
by a terminal owner against smartcards that use his terminal. 

Due to the rapidly advancing state of knowledge about power analysis, we 
cannot make conclusive statements about the effectiveness of specific counter- 
measures. Nevertheless we suggest several possible defenses in §5 that may make 
power attacks more difficult and raise the level of effort and expertise required 
of the attacker. 

2 Background 

More and more cryptographic systems are embedding keys in portable electronic 
modules such as smartcards and PC cards. These modules usually provide both 
storage for the key and processor power sufficient to allow the key to be used 
in situ^ so that the key is never exposed to the outside world. When the holder 
of the module (which we will henceforth assume is a smart card) has a stake in 
keeping the key secret, such modules provide strong, convenient, and inexpensive 
security. 

On the other hand, when the cardholder has an incentive to violate the 
secrecy of the key, protecting the key is a difficult challenge to the system’s 
designer. For example, in the case of stored-value cards, learning the card’s key 
may enable the cardholder to defraud a bank. Since the cardholder has physical 
possession of the card, many avenues of attack are available: 

— The cardholder can subject the card to unusual conditions like out-of-range 
supply voltage, out-of-range clock frequency, extreme temperatures, radia- 
tion, or unusual commands, in order to induce errors. Some errors may di- 
rectly expose keys, while others may produce incorrect cryptographic results 
from which keys can be computed [2]. 

— The cardholder can physically dissect the card and reset protection bits, or 
directly read electrical charges in memory cells, or measure voltages on bus 
traces while sensitive data are passing between memory and processor [6]. 

— While the card is performing cryptographic calculations, the cardholder can 
measure currents, voltages, electric fields, or execution times [4], any of which 
might exhibit correlation with the key being used. 

The current drawn through the card’s power connector (14c) is easy to mea- 
sure with a digital oscilloscope, and provides much revealing information. For ex- 
ample, if the smart card uses a hardware multiplier for modular exponentiations 
of large integers, each multiplication is visible as a distinct period of increased 
current consumption. The fastest implementations of modular exponentiation 
handle ordinary multiplications differently from squarings; but if this technique 
is used in a smartcard, squarings and nonsquare multiplications can be distin- 
guished in an oscilloscope trace of current consumption. From the order in which 
these operations occur, one can deduce the exponent, which is often a vital secret. 
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Almost all processes display a similarly rich variability in current consump- 
tion patterns. In a plot of current consumption versus time during the execution 
of a single high-level command, the eye easily discerns several separate compu- 
tational phases, distinguished by mean current consumption and by “fuzziness” 
(short-term variability in current consumption). Because of such variations, the 
16 rounds of a DES encryption are generally easy to recognize as a train of 16 
identical boxcars. (On closer inspection, one finds that boxcars 1, 2, 9, and 16 
are slightly shorter than the rest, due to key schedule idiosyncrasies.) 

Since the current drawn by the smartcard is, at constant voltage, proportional 
to the power consumed by the card, attacks based on current measurements are 
usually referred to as power attacks. We will henceforth refer to power instead 
of current. 

Differential Power Analysis (DPA), developed by Kocher et al. [5], is a power- 
ful extension of these techniques. In a DPA attack on a DES key, the smartcard is 
repeatedly induced to encrypt various plaintexts with the key to be found, while 
digitized “traces” of power consumption are recorded along with the plaintexts 
of the encryptions.^ When a large number (often on the order of 1000) of tra- 
ces and plaintexts have been accumulated, averages of subsets of the traces are 
computed and compared in order to test guesses of various key bits. 

Specifically, one sorts the traces into two classes according to conjectured 
values of a particular bit B computed during the encryption, the conjectured 
values being computed from the plaintext and a guess at some subset of key 
bits. If the guessed key bits are correct, the conjectured value of B will match 
the true value of B in all traces, and the mean B = I trace will differ significantly 
from the mean B = 0 trace wherever bit B is handled. On the other hand, if the 
guessed key bits are wrong, the sorting of traces into subsets is (one expects) 
uniformly random, and no significant difference will be observed. 

Two beautiful virtues of DPA are the following: (1) although the attacker 
makes the assumption that the DES code computes the value B^ it is not ne- 
cessary to know where that computation occurs; and (2) if chip designers add 
random noise to mask power consumption, the attacker can compensate for the 
lower signal-to-noise ratio by increasing the number of traces. On the other 
hand, practical problems in mounting a DPA attack include: (1) protocols may 
be designed to keep the attacker from seeing the plaintext or ciphertext; and (2) 
sometimes it is not possible to get enough traces. 

More recently. Biham and Shamir [1] described a power attack in which many 
traces from each of several different cards are compared to identify when key bits 
are being handled. Instants with large same-card power variations are assumed to 
be data-handling, not key-handling, instants. Non-data-handling instants with 
large between-card power variations are assumed to be key-handling instants. 
Once the attacker has located the instructions handling parts of the key, their 
power consumptions can be measured to attack the key. Since its “profiling” 



^ Decryptions may be used instead of encryptions; and independently, either plaintexts 
or ciphertexts can be used. 
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stage can be separated from its key extraction stage, this attack falls into the 
class described in the present paper. 

3 IPA Attacks 

Here we describe an IPA attack. Stage 1, the profiling stage, contains almost all 
of the effort. Stage 2, the key extraction stage, is then quick and simple. One 
can think of the profiling stage as a long precomputation, after which one can 
obtain each subsequent key with only a small incremental effort. 

The first steps in the profiling stage are familiar from other sophisticated 
power attacks, such as DPA: collecting a large number of power traces, and then 
aligning and averaging the traces; the details of these steps need to be specifically 
tailored to IPA. Next come the two tasks necessary to finding the key: locating 
and identifying the key bits. These steps are described in detail below. 

As our primary example we will consider an IPA attack on DES, since DES is 
not only well-known but is perhaps the most widely implemented of all crypto- 
graphic algorithms. We have successfully performed IPA attacks on DES smart 
card implementations and have extracted DES keys from a single power trace in 
Stage 2 of our IPA attacks. 

3.1 Context and Assumptions 

We assume the following: we have a smart card containing a known cryptographic 
algorithm but we do not know the details of the implementation, i.e., we do not 
possess the source code (knowledge of the source code would enable a far simpler 
attack). Eurthermore, we can cause hundreds of executions of the algorithm with 
different plaintexts (not necessarily uniformly distributed), and we can record 
the amount of power (current) used at each step within each such execution. 

Eor the purposes of this exposition, we will assume that the algorithm has 
been implemented in a straightforward manner, without introducing elements 
designed to thwart power attacks; in particular, the algorithm execution is a 
deterministic function of the plaintext and key. If the card did employ defenses, 
modified versions of IPA might still work, depending on which defenses were 
used; space does not permit discussion of these modified versions. In §5, we 
discuss some defensive measures that can be used by system implementers. 

These assumptions appear to represent the standard context facing someone 
wishing to attack a smart card. In practice we have successfully performed IPA 
attacks in this context using only moderate resources, e.g., only a few hundred 
power traces and a low sample rate (3.57 million samples per second, or one per 
clock cycle) on an oscilloscope. 



3.2 Stage 1: Profiling 

The goal of the profiling stage is to locate and identify the key bits as they are 
used during the computation. To do this, we often need to learn about other 
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aspects of the implementation we are facing: the order of operations, how key 
bits are handled, and other details that were decided by the programmer. 

The attacker starts by facing an unknown implementation and then gradually 
learns more and more during the course of the profiling stage, until he finds when, 
and often how and where, the program engages the key bits. At the end of the 
profiling stage, the attacker knows many of the decisions that were made by the 
implementer. 

The next three sections describe the important steps in the profiling stage. 



3.3 Profiling: First Steps 

As the attacker, we first cause the smart card (or other device) to execute its 
cryptographic algorithm a large number of times to obtain a large number of 
traces. In practice, we have found that a few hundred traces usually suffices; 
as lower and upper bounds, we estimate that in general the number of traces 
needed will be between 100 and 1000. 

For simplicity of exposition, we will suppose that the executions are all with 
the same key and with varying plaintexts. This is not essential, as the attack 
will work even if the keys vary. Of course, if both the key and the plaintext are 
constant, then we are merely resampling the same data point and the multi- 
ple traces are practically useless. The plaintext needs not be either random or 
uniformly distributed, but merely non-constant. 

The traces are then averaged together, in order to remove the effects of the 
varying data bits while keeping the effects of the constant key bits. Before aver- 
aging, we must first align the traces so that the power consumptions of every 
operation are matched across all the traces. In practice, we have found that each 
implementation of each algorithm requires a slightly different alignment techni- 
que, and the alignment effort ranges from quite simple to quite cumbersome. 

We now have a single “average” trace containing the average power consumed 
throughout the execution. This represents the average, over all plaintexts, of the 
power consumed using the constant, unknown key.^ 



3.4 Profiling: Key Location 

The next major task is to locate the key bits within the average trace. We first 
describe the basic procedure, followed by a more mathematical description, and 
then some mention some implementation notes. 



Basic Description 

Almost all cryptographic algorithms contain repetitive structures, used with 

^ Even if the plaintext is not uniformly distributed, the “data” bits soon become uni- 
form, due to the randomizing effect of the rounds; for example, even if the plaintext 
into DES has its top 50 bits set to one fixed value, the bits in the L and R registers 
become uniform after the first couple of rounds. 
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changing pieces of the overall key. For example, in most symmetric ciphers, re- 
peating rounds use differing subkeys. Public key algorithms such as RSA use 
different key bits while repeatedly performing modular multiplications. We call 
these repeating structures “rounds” , and the corresponding key bits “subkeys” , 
with the understanding that in an algorithm such as RSA, the modular multi- 
plications play the role of rounds. 

Suppose there are n rounds, and let Ki denote the subkey used in round i. 

Due to code space limitations on smart cards and other devices, the repeating 
structures are generated by the same source code being executed over and over; 
therefore, each round’s subkey is handled identically. 

The key location proceeds as follows: we chop the average trace into rounds 
to obtain traces representing “average round 1”, “average round 

2” , and so on. Then we average these together to obtain a single “super-average 
round” 7?., i.e., the average of the average rounds. 

Next, we take the difference, for each round i, between Ri and TZ to obtain 
the “round i difference trace” Z\^. Finally, we square and then average together 
the Z\i’s. These last few steps are equivalent to computing the variances of the 
instruction offsets in the i^^’s. The final trace, the mean square of the Z\i’s, 
contains peaks that reveal the key bit locations. 

Why does this work? At an intuitive level, note that the first averaging (of 
different traces) removed the effects of the data bits but left the effects of the key 
bits: the only differences between the i^^’s are due to the differences between the 
subkeys (see Figure 1 for an example). The second averaging (of different rounds) 
removed the effects of the key bits as well, leaving only “code” features. When 
we then take the difference between average round Ri and the super- average 
round TZ^ the code features cancel out, leaving only the effects of the specific 
subkey Ki. The subsequent squaring and averaging produces clean peaks at all 
subkey bit locations. 

Since we know the algorithm, we know the number of key bits comprising 
each subkey, and therefore we know how many peaks to look for. In DES, we 
look for 48 peaks in the average of the squared Z\i’s. An example is shown in 
Figure 2. 

It is a good idea at this point to observe the distribution of power levels 
at the detected peaks, in order to verify that the peaks represent key bits and 
to determine the power threshold that separates a 0 from a 1 bit value. Since 
the actual key bits are presumably 0 or 1 with probability an instruction 
that handles a single key bit should exhibit a bimodal distribution, with half 
the probability in each mode. An instruction that handles more than one key 
bit should exhibit the appropriate binomial distribution. Thus the shape of the 
power distribution can reveal the way the key bits are being handled. 

In practice, the straightforward cryptographic implementations we have seen 
most commonly on smart cards handle key bits individually, and we have the- 
refore seen bimodal distributions at the peaks. If key bits are not handled indi- 
vidually, one can use the resulting binomial distributions to learn the Hamming 
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Fig. 1. A section of two round traces Ri and Rj. They differ only where a key bit is 
being handled, at positions 984 through 986. 



weights of key bit groupings; one way to proceed in this case is discussed by 
Biham and Shamir [1]. 



Mathematical Description 

Let Tij(t) denote the power consumed at time t within the i-th round in power 
trace j. In general, the power consumed at any time t is a function ft of some 
key bits k = {ki^ . . . and some data bits d = (di, . . . , dg). If we write dij 
and ki for the actual bit values of d and k in round i in the j-th trace {k depends 
only on the round, not the trace), we have^ 

Ti,j(t) = ft(dij,ki) . (1) 



Assuming that the number of traces m is large enough and the numbers of 
dependent key bits r and data bits s are small enough,^ we will find that the 
power Ri{t) in the average round trace is approximately the same as if we had 



^ In all equations we ignore the random power fluctuations due to internal and external 
noise. In practice, these effects disappear once we have done the first averaging. 

^ We would like to see m > max{2’^, 2^}. This is reasonable since typical values might 
he m ^ 300, and r, s < 4. 
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Fig. 2. Peaks in in a DES implementation. In this implementation, there are 48 
peaks, one for each subkey bit, and they are clustered in 8 groups of 6, for the 8 S-boxes. 

averaged over all 2^ possible values of d: 

^ m 

= ~ (^) 

j=i 

^ m 

^ ^ ft i^i,j 7 ^i) 
m 

i=i 

(2) 

d 

Furthermore, taking the average of the Ri^s to get the super-average trace TZ 
will have the effect of averaging over all 2^ values of so that TZ{t) is: 

1 "" 

1 "" 1 

~ “ X] ^ X] Equation 2 

i = l d 

(3) 

k,d 

Consider an instruction, at, say, time ti, that does not involve any key bits; 
then the power consumption function fi depends only on the data bits d, and 
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the value of Ri{ti) from Equation 2 becomes 

( 4 ) 

d 

The important point about Equation 4 is that Ri{ti) does not depend on i 
at all, and is therefore constant among all the rounds and in the super-average 
round 7^, i.e., R{ti) = Ri{ti). So when we take the difference between the average 
round i trace, Ri^ and the super-average round trace, 7^, we find, at position 7i, 

Ai{ti) = n{ti) - Ri{ti) = 0. (5) 

Now consider another instruction, at time 72, which handles some key bits k 
as well as data bits d; its power consumption function /2 then looks like that 
in Equation 1, and the values of Ri{t 2 ) and lZ{t 2 ) look like Equations 2 and 3, 
respectively. In the round i difference trace Ai we then find, at position 72, 

^(^2) = 

= (6) 

d 

which in general is not 0, but some function that depends on the specific values 
of the subkey bits ki C Ki. 

The end result is that the difference traces Ai will be close to zero whenever 
key bits are not being handled. Therefore, when we square and average the Z\i’s, 
we find peaks exactly at those times (like 72 in our example) when key bits are 
being handled. 



Implementation Notes 

An actual implementation of an IPA attack may encounter difficulties; here we 
mention a couple of the most common. 

The super-average TZ will work as planned only if the average rounds Ri 
are aligned at use of the key bits. Therefore, in practice, one may need to do 
some alignment of the Ri^s before averaging. Eor example, in DES, some ro- 
unds contain more shifts of the key registers than others; therefore the offsets 
of instructions at which the key bits are used may differ from round to round, 
and this difference must be accounted for before averaging. Depending on the 
algorithm, this may mean identifying other features within the round traces; this 
is one instance in which the “profiling” can involve learning more than just the 
key usage patterns. 

Another important point is that the number of spikes in Ai may differ from 
the size of the subkey, depending on exactly how the key bits are being handled in 
the (unknown) implementation. However, the number of spikes and the number 
of subkey bits should have a simple mathematical relation. 

At the end of the key location step, we know the times when the power 
consumption depends on the key bits, i.e., we know where to find the key. 
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3.5 Profiling: Key Bit Identification 

Given the list of key bit locations, we still can’t read off the key, because we don’t 
know which location corresponds to which key bit. This process of determining 
the identity of the key bits is the final step in the profiling stage of an IPA attack. 

The actual method of key bit identification varies greatly with the algorithm 
and the implementation. So rather than give overall rules, we restrict ourselves 
to some general comments on algorithmic features, followed by specific remarks 
about some of the more common algorithms. 

An algorithm’s specification may (or may not) restrict the order in which 
the key bits are used. In RSA, for example, the secret key is used as a modular 
exponent, and the exponentiation uses the bits of the key sequentially. However, 
the programmer’s decisions still affect the order in which the key bits are used: 
one can start from either the most significant bit or the least significant bit of 
the exponent, and the bits can be used one at a time or two at a time. But these 
are a relatively small number of choices, and therefore the key identification step 
for RSA is usually fairly simple. 

In DES, on the other hand, key identification can be much more difficult, 
since there are fewer restrictions on key bit order. A DES subkey consists of 
48 bits used as inputs to 8 S-boxes. The S-box operations within a round do 
not depend on one another, and thus all 8! orderings of S-boxes are possible. 
Eurthermore, there is no restriction on the order in which the 6 key bits used 
in an S-box are loaded from the key registers. A DES programmer may in fact 
choose a key bit order not based on any S-box order, but based on, say, the key 
bit locations inside the key registers. 

Because of the large subkey size in DES, we have also run into difficulties 
when we have found less than 48 key locations. Suppose we find only 32 key 
locations. Then the key location step must determine, first, which of the possi- 
ble (32) subsets, and, second, which of the 32! permutations of that subset, we 
are seeing. Of course, in a straightforward implementation, some key bit orde- 
rings are much more common than others. The S-box order is more likely to be 
12345678 or 87654321 than 53821467, for example. Still, we caution against as- 
suming any particular ordering, and after the most obvious guesses have failed, 
it may be unclear where to turn next. 

When the key identification becomes non-trivial, one can turn to the key 
scheduling section of the algorithm’s specification, and select patterns that can 
be sought for in the empirical key location data. In DES, for example, the key 
schedule specifies the patterns in which individual key bits move from one S- 
box position to another in consecutive rounds. This allows a key identification 
hypothosis to be tested by comparing the movement from round to round of I’s 
and O’s in our observed key locations against the movement of fixed key bits in 
the DES key schedule. 

At the end of a successful key identification step, and thus at the end of the 
profiling stage of the attack, we have a table of locations (inside the round traces) 
and the corresponding key bit identity. If we are attacking DES, for example, 
our table might look something like that in Table 1, where the numbers in the 
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location column are offsets into the aligned, averaged round traces {Ri) and the 
key indices refer to bits within the round subkeys (Ki). 



Table 1. The final result of the profiling stage of an IPA attack: the key table. 



Location Subkey Bit 
380 Ra 

672 ki 

1022 



3.6 Stage 2: Fast Key Extraction 

Armed with the key location table, we can easily find the subkey bits and then 
the master key bits from the traces we have. But the profiling data depend only 
on the software implementation that we are attacking, and not at all on the 
key that was used in the traces we processed. Therefore the information in the 
table will be equally valid for all other instances of the same software running 
on identical hardware, and so we can easily find the key in any such instance, 
not just the instance whose power traces we have already recorded. 

In short, the profiling stage needs only be done once, and then key extraction 
can be done quickly and efficiently from new instances with unknown keys. One 
can think of the profiling stage as a long precomputation of the key location 
table; after the precomputation we can then quickly solve any similar instance 
of the same problem. For example, given a second smart card, identical to the 
first except for a different key, the data in our key location table immediately 
point us to the new key, without taking hundreds of new traces involving the 
new key. 

To extract the key from a new instance of the same implementation, we take 
a single power trace, chop it into rounds, and measure the power consumed at 
the locations specified in the key location table. Using our knowledge of the key 
bit power distribution, which we obtained during the profiling stage, we can tell 
whether the key bit is a 0 or a 1. 

Due to the particularities of the algorithm and the implementation, a single 
power trace may not suffice, in which case we would take, say, 5 traces, average 
them together, and then measure the power levels at the key locations. The 
issue of whether one trace will suffice depends, among other things, on whether 
there are instructions that handle key bits separately from data bits. In most 
of the implementations that we have seen, each key bit is handled by itself in 
at least one instruction (for example, the bit is loaded into a register) without 
the interference of data bits; therefore we have been able to extract keys using 
a single trace. 
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In any case, the number of traces needed in the key extraction stage is far 
less than the number needed during profiling, where we needed enough traces to 
average away any effects of the data bits and to ensure that any peaks we found 
were due to key bits only. For key extraction, on the other hand, we already 
know where the key bits are, and only care about whether data bits may affect 
readings at the key bit locations, and can safely ignore the effects of data bits 
elsewhere in the trace. 

4 Strengths of IPA 

There are several aspects of IPA attacks that make them effective in situations 
where DPA attacks would be difficult to mount. 

First we note that in order to mount a DPA attack an attacker needs the 
plaintext (or ciphertext) associated with every trace; but this is not required 
for an IPA attack. Thus one of the important defenses against DPA — protocol 
designs that hide plaintext and ciphertext when master keys are used — is useless 
against IPA. 

Also, a DPA attack is restricted to points in the algorithm where the plaintext 
(or ciphertext) interacts directly with the key; this is because the differential 
traces are based on a “selection function” that predicts a bit value based on a 
small number of plaintext bits and a small number of key bits. In practice this 
usually means that a DPA attack is restricted to the beginning (if plaintext) or 
ending (if ciphertext) of a cryptographic algorithm; for example, DPA attacks 
on DES generally concentrate on either the first or last couple of rounds. 

In contrast, an IPA attack is as capable of looking at the middle of an al- 
gorithm as at the beginning or end. This can be an important advantage in 
cases where some intervening processing is applied to the plaintext before the 
key is directly applied. For example, in a recent DPA attack on the AES can- 
didate cipher Twofish [3], the attackers could only extract a certain “whitening 
key,” after which significantly more analysis (including an exhaustive search of 
9^ possibilities) was needed to derive the master key. An IPA attack, on the 
other hand, can focus its attention on any (or all) of the intervening rounds, and 
thus extract the round keys without any further analysis. 

Another set of advantages of IPA derives from its ability to do fast key 
extraction after a single lengthy profiling stage. This means that the cost of 
the profiling stage can be amortized over many key extractions, thus making 
an IPA attack economically feasible even if the cost of obtaining hundreds of 
power traces is large. In a DPA attack, by contrast, the attacker must collect a 
large number of traces for every key to be broken. If the cost of obtaining those 
traces is greater than the benefit of the key itself, the DPA attack is rendered 
impractical, whereas an IPA attack remains viable. 

East key extraction also overcomes another defense against DPA: a protocol 
may disable a card after only a small number of operations, if the operator does 
not know the secret. Such a protocol can block DPA, but does not block IPA, 
where the many profiling traces can be obtained using a “friendly” card (one 
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where we are the legitimate owner and know the secret), and only a small number 
of traces are needed for each “unfriendly” card being attacked. 

5 Defenses 

Here are some suggestions to system implementers trying to protect against this 
class of attacks. 

First, avoid handling key bits one at a time. Some ciphers are more amena- 
ble to this approach than others. Still, even a bit-oriented cipher like DES can 
sometimes be effectively protected: when a DES master key is inserted, its key 
schedule can be computed once for all time, and stored as six bits in each of 
16 X 8 bytes, ready to be used with no further bit manipulation. Eor further 
protection, unused bits in any byte may be filled with irrelevant values instead 
of being set to zero. 

Randomize the execution of the code. Where the order of operations is un- 
important, such as S-box evaluation in DES, vary the order instead of using a 
fixed order. Insert random delays, even if only one instruction-time in duration. 

Randomize the representation of data. Sometimes a quantity can be “blin- 
ded” by combining it with a randomly chosen constant; for example, value A 
may be maintained as A 0 iFi, value B 8iS B Q K 2 , and A Q B computed as 
(A 0 Ki) 0 (i^ 0 K 2 ) 0 {Ki 0 K 2 ). When a single bit must be handled, consider 
representing the bits 0 and 1 as “01” and “10” . 

Limit the number of times a key can be used without confirmation of legi- 
timacy, while simultaneously reducing the attacker’s signal-to- noise ratio with 
filters or generators of random noise. The addition of noise will prevent key ex- 
traction from a captured trace of a legitimate transaction, and the limit on key 
probes will discourage key extraction from a stolen card. 

Although no single defense makes a system impervious to IPA, and new 
attacks can be expected in the future, adding a variety of these countermeasures 
will likely increase the difficulty of IPA attacks, reducing, one would hope, both 
the number of potential attackers and the probability of any given attacker’s 
succeeding. 

References 

1 . Eli Biham and Adi Shamir. “Power Analysis of the Key Scheduling of the AES 
Candidates,” Second Advanced Encryption Standard Candidate Conference, Rome, 
March 1999. 

2. Eli Biham and Adi Shamir. “Differential Fault Analysis of Secret Key Cryptosy- 
stems,” in Advances in Cryptoloqy — Crypto ’97, Lecture Notes in Computer Science 
Vol. 1294, p. 513-525. 1997. 

3. Suresh Chari, Charanjit Jutla, Josyula R. Rao, and Pankaj Rohatgi. “A Cautionary 
Note Regarding Evaluation of AES Candidates on Smart-Cards,” Second Advanced 
Encryption Standard Candidate Conference, Rome, March 1999. 




186 



P.N. Fahn and P.K. Pearson 



4. Paul Kocher. “Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS 
and Other Systems,” in Advances in Cryptology — Crypto ’96, Lecture Notes in 
Computer Science Vol. 1109, p. 104-113. 1996. 

5. Paul Kocher, Joshua Jaffe, and Benjamin Jun. “Introduction to Differential Power 
Analysis and Related Attacks” . 

http : //www . cryptography . com/ dpa/technical/index . html . 1998. 

6. Oliver Kommerling and Markus G. Kuhn. “Design Principles for Tamper- Res is taut 
Smartcard Processors,” in Proceedings of the USENIX Workshop on Smartcard 
Technology (Smartcard ’99), USENIX Association, p. 9-20. 1999. 




Security Evaluation Schemas for the Public and Private 
Market with a Focus on Smart Card Systems 



Eberhard von Faber 

debis IT Security Services, RabinstraBe 8, D-5311 1 Bonn, Germany 
e-vonfaber@itsec-debis.de 



Abstract. Even users must have some understanding of the different evaluation 
schemas. They must be able to rate the outcomes they rely on and use the op- 
portunities to steer the processes. Some evaluation schemas are designed for 
general purposes others for specific application contexts. The elements of 
evaluation schemas are introduced first. Then observations about smart card 
evaluations are discussed demonstrating that the evaluation or approval process 
itself effects the evidence of the assurance and the value of evaluation verdicts. 
Especially trade-off situations typical of smart card evaluations are discussed. 
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1 Introduction 

Modern companies use information technology more and more to support their tradi- 
tional business activities, to offer them in a better way or to more customers. The 
commercial goals of a company can only be reached if the information technology 
operates perfectly. Nowadays information is a critical resource that enables compa- 
nies to succeed in their business. Therefore, many products and systems provide 
security functions exercising proper control of the information. Companies 
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and the individuals using such products expect that the sensitive information remains 
private and that unauthorised modifications are detected. 

It is key to know (i) whether the products and systems actually and properly respond 
to the security needs in a specific application context and (ii) whether they provide a 
sufficient level of protection. Customers usually do not wish to rely solely on the 
promises given by the product vendor. Therefore, IT-products or systems which al- 
ready exist or which are still in the development stage are to be evaluated by indepen- 
dent specialists to proof if and to what extend the security objectives are met. In addi- 
tion, the developer himself often likes to have some proof and indication how to im- 
prove his solution. 

Security assessments (evaluations) are sub-contracted to independent (third) parties 
(Evaluation Facilities or labs). They have to have the knowledge, expertise and re- 
sources necessary to judge whether the product or system is "secure". The Evaluation 
Facility is eventually expected to pass a verdict. For this it is required to formulate the 
question the lab has to answer as precise as possible. If the question or the definition 
of the Security Target is hazy the evaluation result will not be very helpful. For that 
reason international evaluation criteria specify requirements for content and presenta- 
tion of the Security Target. 

Too little attention is given to the fact that the evaluation or approval process itself 
effects the evidence of the assurance and the value of evaluation verdicts. The eva- 
luation schema, its structure and rules, the criteria and their evaluation methodology 
actually effects the outcome of an evaluation and the information given to the user. 
There are different evaluation schemas being designed for general purposes or for 
specific application contexts. After having introduced the elements of an evaluation 
schema trade-off situations typical of smart card evaluations are discussed. Studying 
these examples give valuable information to individuals, companies and organizations 
using evaluated products. 



1 Evaluation Schemas 

According to almost all security evaluation schemas three parties are involved when 
an evaluation is being carried out: 

□ developer and manufacturer of the product or system, 

□ Evaluation Facility, and 

□ Overseer (Certification Body or Approval Authority). 

The developer and manufacturer applies for a security certificate (to be used in the 
public market) or an approval (for a closed private market). He has to provide infor- 
mation about all the construction details allowing the Evaluation Facility to assess the 
security provided. The Overseer (Certification Body or Approval Authority) monitors 
the evaluation activities, reviews and analyses the evaluation report(s) to assess con- 
formance to the evaluation criteria, the evaluation methodology, and the evaluation 
schema. Finally, the Overseer issues the certificate or the approval. The decision 
made by the overseer is based upon the evaluation report prepared by the Evaluation 
Facility. 
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In the public market the developer/manufacturer wants to demonstrate to customers 
that they can have confidence in the security provided by the product or system. The 
certificate issued by the Overseer (Certification Body in this context) confirms that 
the evaluation has successfully been performed according to the criteria. The users’ 
decision whether to use the product is additionally based on information given in the 
Certification Report which contains the non-confidential evaluation results (major 
findings) and perhaps extra guidelines for operational use. 

In the private market the developer/manufacturer wants to get the approval that the 
product or system can be used in the specific application context given. For instance 
in banking applications (like the POS debit electronic cash system which uses the 
eurocheque card in Europe and many smart card based electronic purse systems like 
GeldKarte in Germany) the card issuers require a successful evaluation before the 
component can be sold to the banks or payment processors and used in the payment 
system. 

In fact approvals are often used for marketing in other markets since they demonstrate 
that the developer/manufacturer is able to provide high-quality products. But unlike in 
public markets the evaluation results are to be accepted by specific institutions. To 
some extend this changes the security evaluations and the co-operation of the parties 
being involved in many ways. 

Regardless of the market or evaluation schema, the Evaluation Facility uses the same 
set of information to perform the assessment. This is visualized in Figure 1. 




Schema 

• Roles 

• Process 

• Responsibilities 



Criteria 

• Principles 

• Assurance Comp. 

(input, task, output) 



Methodology 

• Interpretation of criteria 

• Guidelines for evaluators 

(and developers) 

• Assessment methods 






1 




Evaluation Verdict 

• Requirements (environment, usage) 

• Restrictions (directions, conditions) 



Fig. 1. Evaluation Process as seen by the Evaluation Facility 

The basic input for the Evaluation Facility is as follows: (i) All the information about 
the product’s construction is provided by the developer. He also provides samples of 
the product for penetration and testing, (ii) The Evaluation Facility needs to have a list 
of the security objectives since the product is checked whether to meet the security 
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objectives defined for the context given, (iii) The assurance requirements define on 
what grounds evidence can be given that the product meets its security objectives. 

The assurance requirements are defined in the evaluation criteria. One can find them 
in the "Information Technology Security Evaluation Criteria (ITSEC)" [4], in the 
"Common Criteria" (Part 3: Security Assurance Requirements) [3] and in the "Trusted 
Computer System Evaluation Criteria (TCSEC)" [7] as well. When working on the 
different assurance aspects the Evaluation Facility uses guidelines describing the 
evaluation methodology (for instance [5] and [6]). 

The relations between the three parties are described in a document called Evaluation 
Schema. The Evaluation Schema defines the process together with the roles and re- 
sponsibilities of the three parties. Additional regulations may be given in form of 
assumptions or principles if needed in a specific application context. 



1.1 Public Market (Business to Customer) 

The developer/manufacturer wants to demonstrate to his customers that they can have 
confidence in the security provided by the smart card integrated circuit or another 
product he develops. For that reason he prefers to have an "official" certificate issued 
by an officially recognised Certification Body. The evaluation carried out by special- 
ised laboratories (as third parties) gives evidence of the producfs assurance. Assur- 
ance in turn gives the confidence needed by the customers and users. 

Independent from the application contexts the products are used, evaluation schemas 
were set-up by governmental organisations and assurance requirements were defined. 
Such national evaluation schemas shall meet general market needs. Here criteria such 
as the "Information Technology Security Evaluation Criteria (ITSEC)" [4] and the 
"Common Criteria" ([1], [2], [3]) are used and international recognition agreements 
were signed. In practice recognition of certificates (evaluation results) is a problem 
especially if the Evaluation Facility is not known and the methods of the laboratory 
are not clearly documented in the Certification Report being published. 

The evaluation shall give assurance (or trust) that the product meets its security target. 
The corresponding requirements checked during the evaluation are defined in a 
document called Security Target. The result of the evaluation is given in form of a 
verdict (pass or fail). If needed, directions for the developer or the user of the product 
are given. The criteria set out in the ITSEC or Common Criteria permit the developer 
to define the security functions without any restriction and to choose one out of seven 
evaluation levels representing increasing confidence in the ability of the product to 
meet its Security Target. The evaluation level determines the assurance on the verdict. 
In addition, a minimum strength of the security mechanisms shall be claimed. 

There are clear advantages of having an evaluation schema set-up and maintained by 
governments. Perhaps the most important ones are: 

□ independence from influence of manufacturers and service providers, 

□ broad recognition of the evaluation results and the Certificates, and 

□ possibility to incorporate many institutions, organizations, and laboratories espe- 
cially for definition and maintenance of the evaluation methodology. 
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Therefore, a high quality is expected. Again the quality of the assessment depends on 
the knowledge, expertise and market position of the Evaluation Facilities. The natio- 
nal bodies being responsible for the accreditation of laboratories and the Certification 
Bodies monitoring the evaluations have to have detailed know-how and enjoy a high 
reputation. Otherwise the schema will not be valuable but a burden for the laborato- 
ries (and the vendors) trying to maintain a high standard. 



1.2 Private Market (Banking Applications) 

Especially in banking applications controlled by specific providers of payment sy- 
stems, card issuers or banking associations the approach is little different. For ex- 
ample, the use of a smart card in the German GeldKarte is permitted by Zentraler 
Kreditausschuli (ZKA, the common organisation of the German credit sector associa- 
tions) only after successful evaluation of its hardware and software. This evaluation 
must be carried out by a laboratory (as trusted third party) being accredited by ZKA. 
The evaluation must show that the ZKA-Criteria ([9] or [10]) are fulfilled. 




Fig. 2. Foundation of Application Security in Smart Card based Payment Systems 

In the late 80's the banks in Germany began to develop their own Evaluation Schema. 
In the same time governmental organizations developed security evaluation criteria 
[8]. But the banks decided not to participate in the schema to have the freedom to 
control the evaluation process independently according to their specific needs. 

In banking applications the products (for example smart cards) are designed to provi- 
de specific services. The cards are purchased by experts of the banks or their service 
companies. The specifications are not developed by the card manufacturer but by 
security experts commissioned by the banks, their associations or working groups. 
Note that there are approximately 40 million cards in Germany each equipped with 
the same functionality. So, there are a lot of differences compared to other public 
markets. 
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As shown in Figure 2 the security in banking applications like the electronic cash 
system which uses the eurocheque card or the GeldKarte in Germany is supported by 
a long-term specification process, security evaluations as well as compliance testing. 
The specification of the application must be subject to an evaluation. It is used again 
when performing the exhaustive compliance testing. 

There are some advantages of having an independent evaluation schema. Perhaps the 
most important ones are: 

□ flexible definition of assessment rulers, 

□ possibility to require the analysis of specific attack scenarios and 

□ possibility to require improvements of the products. 

The approval is given if all successful attacks considered are so expensive or difficult 
that the value of gathered information is less than the expenditure. There is some 
opportunity for interpretation which in turn introduces flexibility since the aspects 
(i) attacker’s skill, (ii) attacker’s knowledge, (hi) money and equipment, (iv) time, and 
(v) availability of samples (components) must somehow be combined to yield an 
overall verdict. 

Since Zentraler Kreditausschuli (ZKA), as the Approval Authority for the banking 
applications just mentioned, is held responsible for the security it is in the position to 
demand the improvement of the system. For instance, some years ago a plan has been 
presented how to attack the Data Encryption Standard (DES) using hardware espe- 
cially designed for that purpose [12]. Our company worked out all details and pre- 
sented this information to ZKA [13]. As a result, the German banks decided to move 
to the Triple-DES. Note that there are more than 250,000 terminals and about 
40 million smart cards in the field. 



2 Observations 

In the following examples are discussed showing difficulties in practical evaluations. 
First they help to understand the peculiarities of different evaluation schemas. Then 
the examples shall give indications on how to develop such schemas and the ability of 
the parties being involved to treat with them. 



2.1 Case Study #1: Who Defines the Security Target? 

The Security Target shall define the user's requirements because the product or sys- 
tem is checked against it. The evaluations results in turn (verdict, answer to the que- 
stion defined in the Security Target) guides users whether or not to purchase and use 
the product. Therefore users (knowing the application context), developers (knowing 
the product) and third parties (being familiar with the assessment schemas and me- 
thodologies) should co-operate to define the Security Target. 

The criteria set out in the ITSEC or Common Criteria permit the sponsor to define the 
security functions without any restriction and to choose the evaluation level. But often 
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the Security Target does not exactly meet the user’s requirements. Then the evalua- 
tion will fail to give the evidence exactly needed by the user. 

The user finds himself in a bad position: He has either to demonstrate to the developer 
that the Security Target does not meet exactly his needs or he has to live without 
having a certified product. Note that only a small percentage of the products have 
been evaluated. 

The Common Criteria [1] allow to define such requirements for a set of products all 
intended to respond to the same security needs identified for similar environments. 
Such a set of requirements is called "Protection Profile". A Protection Profile holds 
for a group of products (implementations). Before using such a Protection Profile in 
an evaluation process, it must be evaluated and then filled with all the information 
identifying a special implementation (product). 

As a consequence, a Protection Profile like [14] is a helpful tool for manufacturers 
(when developing their products etc.) and for users (to articulate their security needs). 
Evaluations based on Security Targets which in turn are based on the same Protection 
Profile are expected to yield comparable results. 

Unfortunately, Protection Profiles are often written by the manufacturers and focus on 
specific aspects only [14]. It is therefore up to the users to clearly express interests. 
This can be done by writing Protection Profiles or by defining similar sets of require- 
ments. An outstanding example for the second way are the regulations of the German 
“Digital Signature Acf’ [15] and its “Digital Signature Ordinance” [16]. Here stan- 
dard assurance requirements are used [4]. But functional requirements are also de- 
fined for services and components to be provided for a public key infrastructure 
planned to partly replace the hand written signatures. Users must form consortiums to 
define functional requirements not only assurance requirements. 

2.2 Case Study #2: Logical Versus Physical Security 

Software often being evaluated provides security against hostile access on a well- 
defined interface or on an external channel. The software itself is not subject to an 
attack. Security function and threat agent are well separated. Software provides logi- 
cal security. But in many cases one can not guarantee nor even assume that the attak- 
ker is not able to attack the security functions themselves. If the module is in a hostile 
environment and not protected by other means it can be subject to tampering or other 
types of influencing its behavior. Physical security is required. 

In the nineteenth century Kerckhoff stated that secrecy must reside entirely on the 
key. So, it is assumed that an attacker may have complete knowledge of the crypto- 
graphic algorithm and its implementation. For smart cards this assumption does not 
hold. Especially for hardware the secrecy of design information is important. All 
details about the design and layout relieve attacks since modem equipment such as a 
focused ion beam (FIB) can be used to re-wire the chip. If design data are available 
prior reverse engineering is not required. But Kerckhoff s assumption does not always 
hold for software too. For instance the concrete software implementation of a crypto- 
graphic function may provide measures to avert the Differential Power Analysis 
(DPA, refer to chapter 2.6). Knowing the method of data coding or data processing 
the attack can be refined and perhaps successfully be carried out. So, smart card secu- 
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rity is also based upon concealing information. If the attacker is able to manipulate or 
intentionally disturb the card’s operation, the construction of the countermeasures in 
hardware (and sometimes in software) must be kept secret. Then the attacker must 
carry out a costly reverse-engineering. But hiding information has a disadvantage. 
The more experts had analyzed a solution the higher is to estimate the assurance of its 
security. Restricting the availability of information reduces the number of experts 
having the possibility to approve, disapprove or improve the mechanisms considered. 

It seems that the differences between hardware and software measures have not been 
thoroughly taken into account. The ITSEC and the Common Criteria and their eva- 
luation guidelines do not consider the study time needed to prepare the attack. The 
ZKA approach distinguishes between first and following attacks. 



2.3 Case Study #3: Information Versus Protection 

In fact all security mechanisms realized in the hardware can be disabled or bypassed 
by direct manipulation based on previous reverse-engineering. The hardware develo- 
per may describe his measures built in to counter direct manipulation or reverse- 
engineering in the Security Target. This would inform the user about such important 
security measures but discloses the details to the public. If such kind of information is 
kept secret it is up to the Evaluation Facility to assess them. But for the same reason it 
is difficult to inform the user about the effectiveness of countermeasures against di- 
rect manipulation and reverse-engineering, for instance. But especially card issuers 
taking the risk for fraud in payment systems have legitimate interests to be informed 
about the existence and the effectiveness of the security measures built in. 

All kind of information about such details implicitly may yield information about 
things not thoroughly being addressed and therefore give hints to possible attacks 
(possible weaknesses). And it will not only inform attackers but competitors. Of cour- 
se, security shall not be founded on confidentiality of design details, but not publi- 
shing details often supports security. This can be understood by evaluators working 
for chip manufacturers when they read papers on chip card security being published 
by experts not having the same level of detailed information. 



2.4 Case Study #4: Smart Cards as Composite Components 

Smart card hardware and software is developed by different companies. These com- 
ponents are assembled in a particular way. Then security relevant data are injected 
into the card. The process is depicted in Figure 3. 

Different entities are involved when producing a smart card: software development, 
chip and mask manufacturing, module manufacturing, card manufacturing (plastic #1 
and #2), and personalization in two steps. Even for the smart card chip itself there are 
at least two companies providing components. 
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Hence for practical reasons and to prevent disclosure of details of the design to the 
other party, hardware and software are evaluated separately.^ Apart from such confi- 
dentiality aspects each company must take the responsibility for his own develop- 
ments or services. 
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Application and 
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Software Developer 




Bank (Card Issuer) 
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^ II Mask Manufacturer 
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Fig. 3. Process of Producing Smart Cards 

The security objectives for a smart card are twofold: 

• Ensure "security" for the card when being in the field. 

• Maintain "security" throughout the development and production process. 

Although many specialists concentrate on the security in the field since the smart card 
is delivered into a hostile environment without any security regulations and may be 
subject to tampering, the security in the development, production and personalization 
process is also important. More precise, the organization of the personalization 



^ Of course, for instance for DPA both companies have to co-operate to provide samples for 
analysis soon. 
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process turns out to affect the security functionality to be provided by the smart card. 
So, the security objectives, a smart card component is being assessed against, depend 
very much on the application context which in this case includes the production and 
personalization process. 

Therefore, one has to start with a concept of the smart card supply chain (refer to 
Figure 3). It addresses security and should be a subject to a separate assessment. From 
this a list of requirements for the individual components (and processes using other 
equipment) is gained including information about the number, extend, rigor, and 
depth of the evaluations to be carried out. 

In principle, the evaluation criteria used in public markets (ITSEC and Common Cri- 
teria) cover all the aspects of a product’s life-cycle (especially: development, system 
generation, delivery, configuration, and effectiveness in the field). Nevertheless, a 
central authority is required being responsible to ensure that all the measures fit toge- 
ther. Note that there are several components and processes. The published result of an 
ITSEC or Common Criteria evaluation (certification report) usually do not contain 
enough information to decide whether continual security is guaranteed. 

In case of complicate composite products evaluation schemas like that of Zentraler 
Kreditausschuh (ZKA), as the Approval Authority for the banking applications men- 
tioned above, are very efficient. Because of nowadays rapid changes in technology, 
evaluation overhead should be avoided. Simultaneously, requirements must be de- 
fined soon. ZKA is in the position both to approve components and processes (based 
on evaluation results provided by the laboratories) and to represent and improve the 
"user's requirements" since the payment systems are operated for the banking com- 
munity the ZKA belongs to. 



2.5 Case Study #5: Security Target Definition for Hardware 

The hardware's countermeasures are often characteristics of the device which can not 
easily be described. Sometimes there are countermeasures not being designed as such. 
Nevertheless, the evaluator has to check whether the device has vulnerabilities. For 
example, an attacker may try to cause faulty operations of a smart card processors to 
compromise the modules security [17]. It is well-known that RSA-keys (Bellcore 
attack) or DES-keys (Differential Fault Analysis, DFA) can be read out if the attacker 
succeeds in causing specific faults by exposing the device to radiation or changing the 
environmental conditions in another way. 

The principle of the Differential Fault Analysis (DFA) is shown in Figure 4 (last 
round of the DES, one S-Box i and the associated lines are considered). The attacker 
looks for the key component K(i) but he does not know C(i). Due to a single bit fault 
in Ri 5 one has one or two values i with I(i)’ I(i). (The faulty values are marked with 

a prime.) Comparing error-free and faulty values one has 

S, [I(i) © K(i)] © S, [I(i)’ © K(i)] = 0(i) © 0(i)’ (1) 

The unknown constant value C(i) disappeared. 

There are lots of environmental conditions which may cause erratic operation. For 
instance our laboratory read out keys by superimposing glitches on the power supply. 
Of course, there are rather simple measures to be implemented in the DES calculation 
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to prevent this kind of attack. But what are the issues to be checked by the evaluator 
when he considers the bare hardware only. According to the security evaluation crite- 
ria mentioned above (ITSEC and Common Criteria) the developer must explicitly 
define security functions or mechanisms designed to avert threats. 



Li5 Si6 Ri5 




Ri6 Li6 



Fig. 4. Differential Fault Analysis: Last Round of the DBS 

Hardware and software may be evaluated separately to avoid disclosure of design 
details. In addition, each company must take the responsibility for his own develop- 
ments. But it is often difficult to define a threat for the hardware since the crypto- 
graphic algorithm is realized in the software (outside the Target of Evaluation). Gene- 
ral statements like robustness against failures are hard to check since there are many 
ways to affect the chip and the effect of a malfunction caused by an attacker is diffi- 
cult to rate without knowing the application context. 

But listing the security measures to be implemented in the hardware does not solely 
solve the problem. The smart card must withstand attacks. The analysis of such attack 
scenarios require to assess the suitability, binding and strength of a set of many secu- 
rity measures and characteristics of the hardware (layout for example). Even the latter 
are often not claimed to be a security measure. A smart card hardware offering many 
state of the art security measures may have fundamental vulnerabilities. If defining 
such detailed security requirements then the user (not the hardware developer) unex- 
pectedly will design smart card security. ^ 

User groups like banking consortiums shall carefully use lists of security measures. 
They are helpful as a first guidance. But its again the Evaluation Facility performing 
the detailed investigations. The evaluation schema operated for the users shall ensure 
that skill and knowledge of the laboratories is developed. Lists of attack scenarios and 
methods are mandatory. 

In the case of logical security it can rather easily be decided whether a solution is 
"secure". For example the effort for an exhaustive key search can be calculated and 



^ In addition, if a solution has been disapproved the developer may lead this back to the requi- 
rements he is responsible for. 
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then the probability for a successful attack can turn out to be almost zero. The identi- 
fication of individual mechanisms often supports this analysis. For hardware the ra- 
ting is more difficult. The effort for a successful attack is often significantly smaller. 
When evaluating a complex of many mechanisms, characteristics and properties the 
verdict can be quite clear. But if one considers individual mechanisms they may be 
assessed to be rather weak. So, the result of the strength of functions/mechanisms 
analysis is to some extend pre-determined by the complexity of the functions or me- 
chanisms identified on the Security Target level. 



2.6 Case Study #6: Differential Power Analysis (DP A) 

In July 1998 for instance, the new attack scenario Differential Power Analysis (DP A) 
[11] has been published.^ Although it was known that an external observable like the 
power consumption or radiation contain information about the secrets being processed 
by a smart card or any other device, such kind of attacks have not thoroughly been 
considered before. In July 1998 our lab read out keys from smart cards using DPA the 
first time. 

The principle (using the last round of the DES) is shown in Figure 5. 



1-15 ^16 ^15 




^16 

Fig. 5. Differential Power Analysis (DPA): Fast Round of the DES 

The pseudo code below shows the analysis process. The analysis is carried out for one 
or two bits (denoted by b) of each S-Box (denoted by i). K(i) and I(i) are the values of 
the lines associated with the input lines of that S-Box i. 

locate "time interval" first 

procedure start 

choose: bit line b, key hypothesis K(i) 

for (very much input values) do 

calculate F(b) = S { I (i) © K(i) } 

if F(b) = 0 then V (m) = -1 else V(m) = +1 

m = m+1 



^ The attack has been announced in spring 1998. 
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measure power consumption over time S(m,t) 
for (each time in the interval) do 
calculate linear correlation 
between V (m) and S(m,t) giving CR(t) 
check if CR(t) shows significant "peaks" 
end start . 

for (each key hypothesis K(i) & other bit lines b) do 
execute "procedure start" 

Results measured by our lab are shown in Figure 6. The upper curve is the co- 
variance CR(t), the power consumption over time S(md) is the curve below. 




Fig. 6. Linear correlation (co-variance) CR(t) and power consumption over time S(m,t) 

All manufacturers of smart card hardware and software were requested immediately 
by Zentraler Kreditausschuh (ZKA) to add countermeasures against the DPA (both in 
hardware and software) soon. ZKA has the knowledge and as the responsible orga- 
nization is in the position to demand such improvement of the system. This again 
shows the immediate reaction of this evaluation schema. 



2.7 Case Study #7: Hardware Versus Software 

The design hierarchy of a smart card is shown in Figure 7. Things like the process 
technology are changed not very often but the customer’s software may change rather 
rapidly. The higher the level the more effort and time must be invested to make a 
change. So, one likes to assess the hardware and the application software separately. 
Modem smart card hardware is equipped with special security mechanisms like de- 
tectors etc. For the mechanisms to be effective often the software has to properly take 
advantage of them. The mechanisms must be enabled or initialized. Register bits 
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(flags) signaling a possible attack must be evaluated, interrupts must be served, and 
the software must properly respond to such events to bring the card into a secure state. 



Application 

Additional Software (Patches etc.) 

System Software (ROM Mask Part 2) 

Firmware (ROM Mask Part 1) 

Hardware Design (Basic Mask Set) 

Process Technology, Libraries, and Paramaters 
Fig. 7. Design Hierarchy of a Smart Card 

In some cases, just the evaluator of the hardware formulated requirements how to use 
the hardware’s security characteristics best possible. Restrictions and conditions were 
discovered during the evaluation of the hardware. Or security relevant software had to 
be changed since the characteristics of the memories (especially E^PROM) showed 
that a vulnerability might have been introduced. The other way around it was just the 
evaluator of the software who discovered that special characteristics of the hardware 
are required to maintain security. But of course he could not check if the hardware 
fulfills these requirements. 

Such findings must be listed as guidelines for the evaluator of the other component. 
Zentraler Kreditausschuh (ZKA), as the Approval Authority, ensured that these lists 
are checked before final approval. 

These issues could not have been discovered by locking at either hardware or soft- 
ware only. In addition, it is often not feasible to have such information on a security 
target (or requirements) level. In many cases, the guidelines required (restrictions or 
conditions) were outcomes of an evaluation. Communication between different Eva- 
luation Facilities (if needed), mediated by an experienced Approval Authority or 
Certification Body, is needed when assessing different components which together 
built a secure system. 

It is hard to force a company to disclose the details needed to the another company. 
Usually, they will not co-operate in order to have one evaluation for a composite 
product. In addition, this information comes often from an assessment the developer 
normally did not performed by himself The information being published in evaluati- 
on reports are not sufficient. If such information would have been included secret 
information is disclosed. Therefore, the evaluation schema must support the technical 
communication between the evaluators and force them to look beyond his target of 
evaluation. 



2.8 Case Study #8: Whom Do You Trust? 

Security evaluation criteria like the ITSEC and the Common Criteria are designed to 
assess technical measures. But obviously, security can not be guaranteed by technical 
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means alone. They have to be supplemented by organizational, personnel and other 
measures. 

Obviously, if the developer and the Evaluation Facility collaborate backdoors and 
exploitable vulnerabilities may exist. More general it is up to the evaluators to gua- 
rantee assurance. Therefore, laboratories with long-term experience and good reputa- 
tion shall be used. 

Many organizations define requirements for the smart card design and production 
process. They consider traceability aspects as well as technical measures to proof the 
authenticity of the components being approved. Some of those concepts are 
reasonable but other ideas go too deep. It is always important to consider carefully 
before requiring technical measures. In smart card production processes this may 
introduce too much overhead and overtax the possibilities of the vendors. One should 
focus instead on the hostile actions for the card in the field and on the most risky 
actions like (i) the injection of keys and other critical data, (ii) the mechanisms needed 
to protect components not being ready to be issued and (hi) the security provided by 
the hardware and software of the card. 

So, there is again a trade-off: Technical measures help to reduce the trust needed 
when services are delegated to other parties. But complex technical solutions can be 
too difficult and expensive to realize and may overtax the possibilities of the vendors. 



3 Summary 

Presently, too little attention is taken on the fact that the evaluation or approval 
process itself effects the evidence of the assurance and the value of evaluation ver- 
dicts. There are different evaluation schemas being designed for different purposes. 
After having introduced the elements of an evaluation schema trade-off situations 
typical of smart card evaluations were discussed. Studying these examples give value 
information to individuals, companies and organizations using evaluated products: 

The criteria set out in the ITSEC or Common Criteria permit the sponsor to define the 
security functions. If the evaluations are directed by the developer, then the evaluation 
may often fail to give the evidence exactly needed by the user. Users must form con- 
sortiums to define functional requirements not only assurance requirements. 

Software provides security against hostile accesses on a well-defined interface or 
external channel. Security function and threat agent are often well separated. Software 
provides logical security. Hardware modules with embedded software used in a ho- 
stile environment can be subject to tampering or other types of influencing its beha- 
vior. Physical security is required. Kerckhoff assumed that an attacker may have 
complete knowledge of the cryptographic algorithm and its implementation. For smart 
cards this assumption does not hold. Especially for hardware the secrecy of design 
information is important. But Kerckhoffs assumption does not always hold for soft- 
ware too. But restricting the availability of information reduces the number of experts 
having the possibility to approve, disapprove or improve the mechanisms considered. 
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Differences between hardware and software measures have not been thoroughly taken 
into account. The ITSEC and the Common Criteria and their evaluation guidelines do 
not consider the study time needed to prepare the attack. The ZKA approach distin- 
guishes between first and following attacks. 

In fact all security mechanisms realized in the hardware can be disabled or bypassed 
by direct manipulation based on previous reverse-engineering. According to the IT- 
SEC and the Common Criteria the countermeasures must be described in the Security 
Target. But this would discloses the details to the public. But especially card issuers 
taking the risk for frauds in payment systems have legitimate interests to be informed 
about the existence and the effectiveness of the security measures built in. 

Different entities are involved when producing a smart card. For practical reasons and 
to prevent disclosure of details of the design to the other party, hardware and software 
are evaluated separately. Apart from such confidentiality aspects each company must 
take the responsibility for his own developments or services. And the organization of 
the personalization process turns out to affect the security functionality to be provided 
by the smart card. In principle, the evaluation criteria used in public markets (ITSEC 
and Common Criteria) cover all the aspects of a product’s life-cycle. But sometimes 
the security objectives for a smart card component depend very much on the produc- 
tion and personalization process. A central authority is required being responsible to 
ensure that all the measures fit together. The published result of an ITSEC or Com- 
mon Criteria evaluation (certification report) usually do not contain enough informa- 
tion to decide whether continual security is guaranteed. 

Unfortunately, it is often difficult to define a threat or security objective for the hard- 
ware as required by the criteria. General statements like robustness against failures are 
hard to check since there are many ways to affect the chip and the effect of a mal- 
function caused by an attacker is difficult to rate without knowing the application 
context. For software it can rather easily be decided whether a solution is "secure". 
For hardware the rating is more difficult. The effort for a successful attack is often 
significantly smaller. In addition, the result of the strength of functions/mechanisms 
analysis is to some extend pre-determined by the complexity of the functions or me- 
chanisms identified on the Security Target level. 

Zentraler Kreditausschuli (ZKA) has the knowledge and is in the position to demand 
improvement of the system. Examples given show the immediate reaction of this 
evaluation schema. A plan has been presented and worked out how to attack the Data 
Encryption Standard (DES). As a result, the German banks decided to move to the 
Triple-DES. Some years later all manufacturers of smart card hardware and software 
were requested by ZKA to add countermeasures against the DPA (both in hardware 
and software) soon. The criteria were extended. 

In some cases, just the evaluator of the hardware formulated requirements how to use 
the hardware's security characteristics best possible. Restrictions and conditions were 
discovered during the evaluation of the hardware. The other way around it was just 
the evaluator of the software who discovered that special characteristics of the hard- 
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ware are required to maintain security. Such findings must be listed as guidelines for 
the evaluator of the other component. Therefore, the evaluation schema must support 
the technical communication between the evaluators and force them to look beyond 
his target of evaluation. 

Obviously, security can not be guaranteed by technical means alone. They have to be 
supplemented by organizational, personnel and other measures. Many organizations 
focus on technical measures for the smart card design and production process. In fact, 
technical measures help to reduce the trust needed when services are delegated to 
other parties. But complex technical solutions can be too difficult and expensive to 
realize and may overtax the possibilities of the vendors. 
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Abstract. The scheme of a device that should have a simple and reliable 
implementation and that, under simply verifiable conditions, should generate a 
true random binary sequence is defined. Some tricks are used to suppress bias 
and correlation so that the desired statistical properties are obtained without 
using any pseudorandom transformation. The proposed scheme is well 
represented by an analytic model that describes the system behaviour both 
under normal conditions and when different failures occur. Within the model, it 
is shown that the system is robust to changes in the circuit parameters. 
Furthermore, a test procedure can be defined to verify the correct operation of 
the generator without performing any statistical analysis of its output. 

Keywords: True random number generators, noise, cryptography, tests for 
randomness. 



1 Introduction 

Cryptographic systems should use only true random number generators for produeing 
keys and other seeret quantities. This paper aims at defining the seheme of a true 
random number generator that has a simple and reliable implementation and is not 
expensive in produetion. To ensure all these features, the generator must be able to 
stand large toleranees in its eomponents without any ealibration or eompensation. 
Furthermore, possible malfunetions must be foreseen and tests to be made during 
prototype development, produetion and (possibly) operation must be defined. Sinee 
the generator is designed for eryptographie applieations, the random souree it uses 
must be suitable to be eonstrueted in a proteeted and insulated environment. In this 
way the deviee ean be eertified to work under general and heavy operating eonditions. 

A popular way of generating truly random binary sequenees is to sample analogieal 
white noise after it has been quantized by means of a eomparator. Beeause of offsets 
and bandwidth limitations, the generated sequenee is typieally affeeted by bias and 
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symbol correlation, but some trieks are used to suppress both. The bias is eliminated 
by sending the quantized signal into a binaiy eounter before sampling it, whereas the 
bit eorrelation is kept under a fixed value by ehoosing a suitably low sampling 
frequeney [1-5], Therefore, in this kind of generators, defeets in the bit statisties are 
not masked (e.g. by means of a pseudorandom transformation) but simply suppressed. 
This ean be eonsidered the most eorreet solution sinee the deviee should generate a 
sequenee whose entropy is the maximum possible, not a sequenee whose entropy 
looks like the maximum possible. In a eertifieation testing one is thus foreed to 
eonelude by an analysis of the seheme that, if the output sequenee looks random, i.e., 
if it passes the statistieal tests, it is aetually random. 

The generator proposed in this paper (see Fig. 1) follows this seheme, but its 
peeuliarity is that the input noise is sampled and held. This solution ensures that the 
input noise does not ehange its value during the eomparator response time so that the 
deviees in the sueeessive stages ean operate under the eonditions they are designed for 
[3]. The proposed seheme is then well represented by an analytie model that deseribes 
the deviee behaviour both under normal eonditions and in presenee of different 
failures. In this way the system insensibility to ehanges in the eireuit parameters ean 
be evaluated. Within the same model, a test proeedure ean be defined to verify the 
eorreet operation of the eireuit without performing any statistieal analysis of its 
output. It is shown that, if the random souree is shielded (so that no external signal is 
injeeted) and does not sustain self-oseillations, the eireuit operation ean be tested by 
simply eounting the transitions of an internal signal. 




The rest of the paper is organized as follows. In Seetion 2 eaeh of the bloeks that 
eonstitute the eireuit is deseribed and its role is explained. Furthermore, the generator 
self-testing proeedure is proposed. In Seetion 3 an analytieal model of the eireuit is 
sketehed and the autoeorrelation funetion of the binary eounter output, i.e., of the 
signal to be sampled for obtaining a binary random sequenee, is given. Results of 
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numerical simulations, which are in good agreement with the model, are also 
reported. A criterion for choosing the output sampling frequency, based on the form 
of the autocorrelation function, is then proposed. Some instructions for the practical 
design of the generator are given in Section 4 and conclusions of the work are 
presented in Section 5. The details of the calculation of the autocorrelation function 
are described in Appendix A and some numerical results supporting the self-testing 
procedure are reported in Appendix B. 



2 Scheme of the Circuit 



Our scheme uses a gaussian white noise source, e.g. shot noise in a directly polarized 
semiconductor junction. Shot noise is completely controlled by the polarization 
current, but its amplitude is typically very low and must therefore be strongly 
amplified. Since a high gain is required, some caution must be taken in the amplifier 
design so that external disturbances are shielded and coloured noises are not added 
[6]. In Fig. 1 the amplified real noise generator is represented by an ideal noise 
generator connected in series with a low-pass filter, whose cutoff frequency Vq 
represents the bandwidth limitations of the real generator. 

The sampling and holding operation ensures that the comparator works correctly 
and permits to sample the binary counter output in a synchronous way. All the 
statistical defects that could appear in the output binary sequence if it were generated 
by sampling an unstable signal are therefore avoided. It will be explained in the 
following how the holding time, i.e., the period of the clock Ckl, must be chosen for 
this purpose. Details of the sample -and-hold circuit will not be examined because it is 
well known that such devices, operating up to some GHz, can be implemented in a 
simple and economical way. 

To obtain simple analytic results, in the following the sampled noise that enters the 
comparator is supposed to be white, i.e., uncorrelated. This hypothesis is reasonable, 
since the sampled noise correlation is fixed by the filter bandwidth and by the input 
sampling frequency, i.e., the frequency of Ckl. For instance, if x(/^) is the signal 
obtained by means of a first order Butterworth filtering [7] of white noise, its 
autocorrelation function is, see e.g. [8], 



^44 



(x(^)x(^-h r)) 



= exp(-27TVQ|T|) □ , 



( 1 ) 



where brackets denote statistical average. If the input sampling frequency is the 
correlation between two consecutive samples of x(/^) is 
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exp(- 27 T v„/vi) ( 2 ) 

and is controlled by the ratio of the two frequeneies. 

The eomparator eonverts the analogieal noise into a binary signal. Comparators 
with hysteresis are generally used to obtain a fast response time. Notiee that the 
eomparator is supposed to be the slowest eireuit eomponent, so that its response time 
determines the whole system operating frequeney. Using eurrent teehnologies, this 
is often the ease. 

The binaiy eounter ensures that its output takes on both its possible values for the 
same average time^, even if its input is biased beeause of the offsets introdueed by the 
eomparator and by the sample-and-hold [9]. An alternative way of eliminating bias is 
to eontrol the eomparator threshold by means of a feedbaek loop, see e.g. [10]. 
Anyway, it is well known that this solution may introduee some degree of eorrelation 
in the output bits [2]. Furthermore, the feedbaek eireuit is eritieal and requires 
aeeurate ealibration, whieh is not needed in our seheme. 

The DFF (delay flip flop) samples the binary eounter output at times eorresponding 

to the edges of the eloek Ck2^ and generates the required binary sequenee. The N 
eounter produees Ck2 as a submultiple of the eloek Ckl at whieh the input noise is 
sampled. is ehosen to keep the output bit eorrelation lower than a fixed value. 

Sinee Ck2 is synehronous with Ckl by eonstruetion, if the period of Ckl is larger 

than the eomparator response time ^ it is ensured that the binary eounter output is 
sampled when it is in a stable state. Any effeet due to threshold offset, asymmetry in 
saturation output voltages and in rising/falling times, threshold dependenee upon the 
state of the deviee and bandwidth limitation of the eomponents is therefore avoided. 
These effeets are veiy insidious, sinee they eause fluetuations of the time required by 
the binary eounter output for erossing the DFF threshold and ean reintroduee in this 
way a new bias to the produeed bits [3]. In faet, as long as the eomparator response 
time is small enough, both the binary eounter and the DFF work on the usual binary 
signals they are designed for, so that the behaviour of these deviees should be 
extremely reliable. 

On the other hand one ean be persuaded that an inerease in r^, as well as any 
offset and any deerease in the amplifier gain and bandwidth, ean be deteeted. In faet, 
while making the output statisties worse, all these effeets result in a deerease of the 



^ Corresponding to the average time between two transitions of the eomparator in the same 
direetion. 

^ Notiee that the output sampling may be triggered indifferently by negative or positive edges 
ofCk2. 

Response times of the following stages are supposed to be negligible with respeet to . 
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number of circuit internal transitions In Appendix B it is shown by numerical results 
that such a decrease is noticeable before the output statistics is substantially damaged. 
Counting the internal transitions can therefore be a simple self-testing procedure for 
the generator. In Appendix A the expected number of transitions during a given time 
interval is calculated under ideal conditions. If the counted number shows a 
significant departure from this expected value, it is reasonable to suspect that some 
circuit component is faulty enough to spoil the statistics of the produced bits, that 
consequently have to be discarded. 



3 Model of the Circuit and Output Correlation 



The amplified noise x(^) is assumed to be a stationaiy and ergodic stochastic process 
and the random variable represents the value sampled at the instant and held 
until A+i ■ The comparator output during this interval, if there is no hysteresis and the 
threshold value is 0, can be defined as 



y 



n 



/ ^ if 

*■ 84 '-.)= if 



0 

< 0 



□ . 



(3) 



This transformation is known in literature as hard limiting or clipping [11]. Here 
the value -1 is chosen instead of 0 so that ) = 0 means that no bias occurs. This 

happens if there is no offset, i.e., the comparator threshold coincides with the sampled 
noise mean value, (x„) = 0. The following calculations are made under such 

hypothesis, that will be discussed at the end of this section. If the clipped noise 
produced by the comparator is unbiased, its autocorrelation function is 

■ (4) 

The sampled noise is supposed to be 5 -correlated, that is R^{k) = where 
5 is the Kronecker symbol. As stated in the previous section, this hypothesis is not 
critical. In Appendix A it is shown that, as long as the comparator shows no 
hysteresis, Ry{k) is 5 -shaped too. 

The binary counter output, denoted by z^, takes on the values ±1. For the very 
nature of this device, (z^) = 0 and this result holds even if there is any offset in the 
previous stages, causing to differ from zero. The binary counter output 

autocorrelation function is 



4 



This is not true for periodic disturbances, which are suppressed by a careful circuit shielding. 
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= ■ ( 5 ) 

If no hysteresis is present, ealeulation of this funetion (see Appendix A) yields the 
result 

It must be remarked that, after passing through the binary eounter, the noise is no 
longer 5 -eorrelated. 

When the eomparator shows hysteresis, the relation (3) beeomes 

^ L^ign(^„ -^0 if y„-i = -1 ^ (7) 

^ign(x„-x^) if =+l 

where and are two different threshold values and x^ >^d- 
in Appendix A, the ealeulation of Ry{k) and is eonneeted to the problem of 

eounting the noise zero erossings, whieh in presenee of hysteresis is usually 
eonsidered diffieult [I], Nevertheless for diserete time evolution analytie results ean 
be obtained if thresholds are symmetrie with respeet to the noise mean value, i.e., if 
x^ = —x^. In this ease, sinee the used input noise distribution j?(x) is symmetrie too, 
the probability of a eomparator state ehange at any time step does not depend upon 
the ehange direetion and it is given by 

1 (8) 

p = \^^x)dx = < — □ . 

X^^ —CO 

In Appendix A the result 

R^{k) = {l-2pfn , ( 9 ) 

whieh shows that hysteresis provides the eomparator output with memory even if 
the input noise is white, is obtained. Furthermore in Appendix A it is shown that 

cos^k 0(p)^ □ , (10) 

where 

>'ip)^\{^-py +p^ 

(notiee that 0 < r(p) < 1 ) and 
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\ U p U ( 12 ) 

0{p) = arc ^ ^ 

Eq. (10) shows that the envelope of R^{k) deeays exponentially for any value of 
the probability p. In partieular, the fastest possible deeay takes plaee for p-\jl, i.e., 
when no hysteresis is present and Eq. (10) reduees to Eq. (6). 





Fig. 2. Analytical form (continuous line) and numerical values (circles) of without 

hysteresis (left) and with hysteresis (right). In the latter case the threshold values are ±0.1 



The eireuit behaviour has been numerieally simulated by means of the Si mu link 
software. Gaussian white noise with standard normal distribution has been used and 
R^{k) has been estimated as a time average using 800000 samples of The plot on 
the left in Fig. 2 shows the result of a simulation where no hysteresis is present, 
together with the theoretieal eurve (6), whereas the plot on the right shows the result 
of a simulation with =0.1^, together with the theoretieal eurve (10). In the latter 
ease the value of p is 



P = 



ir"" ' 

0.1 



(13) 



In both figures the agreement between theoretieal values and numerieal data 
(represented by eireles) looks good. Indeed, the r.m.s. differenee is about 10“^. 

The form of R^ik^ provides us with a eriterion for ehoosing the output sampling 
frequeney. If a bit eorrelation lower than s is required, the minimum value k^ sueh 
that 

[r(;?)]Ae V k>k^ (14) 



^ Notice that thresholds are measured in units of the noise mean amplitude. 
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has to be determined, is the optimal ratio of the input sampling frequeney to the 
output one and therefore the value N -k^ must be ehosen for the N eounter.^ 

Throughout the ealeulations no offset has been supposed. If this were the ease, the 
eomparator output would be unbiased and the binary eounter would not be needed at 
all. The analytieal study of the eorrelation beeomes diffieult and eumbersome if offset 
is taken into aeeount, but the results found here under simplifying hypotheses allow a 
eonservative estimate of the output sampling frequeney even in real eireumstanees. 

Consider indeed a eomparator affeeted by the offset with thresholds s±x^. For 
a given input noise this deviee shows a larger transition rate with respeet to a 
eomparator with no offset and thresholds ±Xq, where Xq =|^| + x^. An intuitive 
explanation ean be gained by looking at Fig. 3, where the ease ^ > 0 is represented 
and x(f) is shown instead of its samples. 




Fig. 3. Crossings of thresholds affected by offset (dots) and of broader thresholds with no offset 
(squares) by the same input noise 

A smaller transition rate eauses a slower deeay of the eorrelation. Therefore a 
eonservative estimate of the output sampling frequeney ean be obtained by 
eonsidering the eorrelation ealeulated for the larger hysteresis band defined above to 
inelude offset. 



4 Some Design Instructions 

The designer of a random number generator of the type eonsidered here should take 
into aeeount the following set of instruetions. 

1) The input sampling frequeney i.e., the eloek frequeney of the eireuit, is 
determined by the eomparator response time through the eondition 



6 



N could also be chosen in order to obtain cos[A0(/?^]=O, but such a condition is more critical 
than the one stated in Eq. (14). 
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( 15 ) 



2) The correlation of the sampled noise must be negligible with respect to the 
correlation introduced by the subsequent stages. If the maximum acceptable value for 
the latter is s , the amplifier cutoff frequency Vq must verify 

exp(-27T v„/vj)<e ( 16 ) 



for the filter considered here, or a similar condition for a different filter. Eq. (16) 
gives 



line I 

^0 > E ^ a 

27T 



(17) 



In Appendix B it is shown that a practically white input noise can be obtained even 
if Vq and are of the same order. A similar result is obtained in [9]. 

3) Once the input noise distribution /?(x) has been estimated, the probability 

Xq 



is determined by Xq . This positive quantity has been defined in the previous section 
in terms of the actual hysteresis and offset, both measured in units of the noise mean 
amplitude. r{p^ is then calculated by means of Eq. (I I). 

4) Finally the condition 

a,>J!1£Ld, (») 

Hd?)]! 



which follows from Eq. (14) with = A, sets the value of N and therefore of the bit 
rate 



Vi 

V, =-^D 
^ N 



( 20 ) 



Notice that, once the bit correlation s has been fixed, V 2 increases with p, i.e., as 
it is intuitive, the bit rate grows as long as offset and comparator hysteresis, which 
cannot be totally suppressed, diminish with respect to the noise amplitude. 




A Reliable True Random Number Generator for Cryptographic Applications 213 



5 Conclusions 

The complete unpredictability of the random numbers used by a cryptographic system 
is a necessary condition for the system security that can be satisfied only by means of 
a truly random source. On the other hand, sources of this kind often produce bit 
sequences whose statistics depend in a critical way on details of the implementation. 

The circuit proposed in this paper belongs to a kind of true random number 
generators that are well known to produce unbiased bit sequences. It is designed to be 
insensitive as possible to fluctuations in the behaviour of the circuit components so 
that no calibration nor compensation is required. Furthermore, it is satisfactorily 
described by an analytical model that gives a relationship between the bit rate and the 
maximum expected bit correlation. The model gives also the expected value of the 
circuit internal transition rate. Since in our design phenomena that could spoil the bit 
statistics also slow down the circuit dynamics, counting the transitions and comparing 
their rate to its expected value can be a good self-testing procedure. 

An actual circuit that verifies the hypotheses underlying our model generates 
binary sequences whose randomness is ensured by the circuit design. Such a system 
requires a small amount of time for its testing during production, since demanding 
statistical tests can be performed on prototypes only. Furthermore, true randomness of 
the generated bits can be controlled in a simple and effective way even while the 
system is operating. 
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Appendix A: Calculation of the Autocorrelation Functions 



In the following the probability P^(/) that the eomparator ehange its state a number / 
of times in the interval will be needed. / is the number of noise zero 

erossings during the eonsidered interval. Under the assumptions of diserete time 
evolution, white noise and no offset, if the distribution /?(x) of is symmetrie (not 
neeessarily gaussian), the probability of a eomparator state ehange at any time step 
does not depend upon the ehange direetion. Therefore P^(/) follows a binomial 
distribution. 






(A.l) 



When the eomparator shows no hysteresis. 



p = ^(x) dx = — U 



(A.2) 



When hysteresis is present, the hypotheses leading to the binomial distribution 
P^(/) given by Eq. (A.l) still hold provided that thresholds are symmetrie with 
respeet to the sampled noise mean value, i.e., x^ = -x^. In this ease the value of p is 
given by Eq. (8). 

Sinee the elipped noise is represented by a sign funetion, its autoeorrelation 
Ry{k), defined as in Eq. (4), ean be given the form 
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Ry{k)= P,{l)U. (A.3) 

/even /odd 

It can be proven by simple algebra, using Eqs. (A.3) and (A. 1), that 

R^{k) = {l-2pf (A.4) 

(here and in the following, the absolute value of k is used to generalize results to 
negative values of k). When no hysteresis is present, Eq. (A. 2) holds and therefore 

= (A.5) 

R^ik) ean be evaluated by means of the probability P^k) that the binary eounter 
ehange its state an even number of times in Indeed R^iJ^) can be given a 

form analogous to Eq. (A.3), whieh is also equivalent to 

= . (a.6) 

If at the instant the eomparator has ehanged its state an even number of times, in 
every transition of the eounter eorresponds to two transitions of the 
eomparator. Therefore in this ease the number / of eomparator state ehanges must be 
equal to Am or Am +1, where m is an integer sueh that / to make the 

eounter ehange its state 2m times. On the other hand, if at the instant the 
eomparator has ehanged its state an odd number of times, its first transition in 
eoineides with the first eounter transition. Therefore in this ease one less 

eomparator transition is needed for an even number of eounter transitions to oeeur and 
/ must be equal to Am - 1 or Am . 

When there is no hysteresis, it follows from Eqs. (A.3) and (A.5) that the number 
of eomparator transitions oeeurred before has the same probability of being even 
or odd for every value of n.\n presenee of hysteresis this is no longer an exaet result, 
but it is nevertheless a valid approximation, sinee Ry(k) drops exponentially. In both 

eases thus 



U U U U (A.7) 




This result gives Eq. (A.6) the form 
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RXk)= P,{1)~ P,{l)U. (A.8) 

/^O 1^2 

mod 4 mod 4 

Substituting Eq. (A.l) into Eq. (A.8) gives 

R.{k)= =9^{[(i-/^)±'>]'} V 

where 9^1 denotes the real part. This expression is generalized by taking the absolute 
value of k and it ean be put in the form (10) using the polar representation of 
eomplex numbers. If there is no hysteresis, p = l!2 and Eq. (6) is obtained. 



Appendix B: Number of Internal Transitions vs. Output 
Correlation 

Counting the internal transitions is a good self-testing proeedure for the generator 
we designed, as long as the inerease in output eorrelation is due to phenomena that 
slow down the eireuit dynamies and not to periodie disturbanees. The eonneetion 
between the number of transitions and the output eorrelation has been eonfirmed by 
further numerieal simulations of the eireuit in whieh two different effeets have been 
separately eonsidered. 

The first phenomenon taken into aeeount has been the inerease in the eomparator 
hysteresis, whieh, in our model, ean represent lowering input noise as well as 
inereasing offset. In eaeh simulation 100000 samples of have been generated for a 
fixed value of the hysteresis band half width Xq . Some of the results are shown in 
Table 1. Eq. (18), where /?(x) is the standard normal distribution, holds for the 
probability p and the expeeted number of binaiy eounter transitions, 

= = 50000^0, (B.l) 

is in good agreement with the eounted number . 

In Table 1 theoretieal and numerieal values of A ^(20) are also reported, sinee 
A = 20 ean be a suitable value for the N eounter. Theoretieal values have been 
ealeulated by means of Eqs. (10-12). As the r.m.s. differenee between theoretieal and 
numerieal values of R^ik) is about 5x10'^ in eaeh experiment, simulations ean be 
eonsidered eonsistent with the model. Notiee that data in parenthesis, whose absolute 
value is lower than the r.m.s. error, are shown only for the sake of eompleteness. It 
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can be seen that a signifieant inerease in eorrelation oeeurs when the number of 
transitions reduees to about one half of the initial value. 

Table 1. Number of internal transitions and output eorrelation (both expeeted and numerieal) 
for different eomparator threshold values. 



Jo 


(V) 


V 


^.(20) 

theor. 


^.(20) 

num. 


0 


25000 


25166 


-0.0010 


(0.0045) 


O.I 


23029 


23149 


3 X 10-5 


(10-5) 


0.5 


I55I5 


15463 


-0.0021 


(0.0006) 


0.7 


12208 


I2I33 


0.0105 


0.0204 


I 


7883 


7894 


-0.0355 


-0.0353 



In the seeond series of simulations the effeet of a finite noise bandwidth, i.e., of a 
eorrelated input, has been studied. In eaeh experiment 100000 samples of have 
been generated for a fixed value of the frequeney ratio Vq/v^ always assuming no 
hysteresis, i.e., Xq = 0 . Some of the results are shown in Table 2. 

Table 2. Number of internal transitions (expeeted and numerieal) and numerieal output 
eorrelation for different eutoff frequeneies. 





(V) 


V 


^.(20) 

num. 


00 


25000 


25166 


0.0045 


0.5 


24312 


24384 


-0.0010 


O.I 


16044 


1 607 1 


-0.0058 


0.05 


1 1967 


1 1953 


-0.0156 


O.OI 


5583 


5706 


0.1546 



In this ease the expeeted value whieh looks in good agreement with the 

numerieal value N is still given by Eq. (B.I), but p has now the form 
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P 



1 

= — arccos 
n 




1 

= — arccos 

71 



exp|^-27T Vq 




□ . 



(B.2) 



according to the well known aresine law [12] assuming first order Butterworth 
filtering. It ean be seen in this ease too that a signifieant inerease in eorrelation oeeurs 
when the number of transitions reduees to about one half of the initial value. 

Notiee that numerieal values only of f?^(20) are reported in Table 2. Indeed, the 
model used throughout this paper for determining the funetion R^{k) eonsiders input 
white noise. This hypothesis is erueial for the binomial distribution (A. 1) to hold. As 
the frequeney ratio deereases, the model loses its validity and, for Vq/ < 0.1, it ean 
be seen that it gives no longer aeeount for the numerieal results. On the other hand. 
Table 2 shows how larger values of Vq/v^, e.g. 0.5, do not eause signifieant 
deviations from the ideal ease of infinite Vq . This result eonfirms that the white noise 
hypothesis is not eritieal. 
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Abstract. The strength of a cryptographic function depends on the amount of 
entropy in the cryptovariables that are used as keys. Using a large key length 
with a strong algorithm is false comfort if the amount of entropy in the key is 
small. Unfortunately the amount of entropy driving a cryptographic function is 
usually overestimated, as entropy is confused with much weaker correlation 
properties and the entropy source is difficult to analyze. Reliable, high speed, 
and low cost generation of non-deterministic, highly entropic bits is quite 
difficult with many pitfalls. Natural analog processes can provide non- 
deterministic sources, but practical implementations introduce various biases. 
Convenient wide-band natural signals are typically 5 to 6 orders of magnitude 
less in voltage than other co-resident digital signals such as clock signals that 
rob those noise sources of their entropy. To address these problems, we have 
developed new theory and we have invented and implemented some new 
techniques. Of particular interest are our applications of signal theory, digital 
filtering, and chaotic processes to the design of random number generators. Our 
goal has been to develop a theory that will allow us to evaluate the effectiveness 
of our entropy sources. To that end, we develop a Nyquist theory for entropy 
sources, and we prove a lower bound for the entropy produced by certain 
chaotic sources. We also demonstrate how chaotic sources can allow spurious 
narrow band sources to add entropy to a signal rather than subtract it. Armed 
with this theory, it is possible to build practical, low cost random number 
generators and use them with confidence. 



Introduction 

RNGs (Random Number Generators) are hardware and/or software sources that 
supply bits (or numbers) that ideally are statistically independent. In this paper we 
will talk solely about analog RNGs, that is, RNGs whose initial source of entropy is 
analog noise. As such, these RNGs are non-deterministic. In contrast, PRNGs 
(pseudo-random number generators) are deterministic in that their output is 
completely determined from their initial state or “seed”. 

RNGs are used to generate independent bits for cryptographic applications such as 
key generation or random starting states, where it is vital that the key or state cannot 
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be predicted or inferred by the adversary. They are also used as hashing or blinding 
factors in various signature schemes, such as the Digital Signature Standard. 

Our RNGs have employed either thermal or shot noise, and have been 
implemented in both discrete and integrated forms. Other RNGs have been driven by 
sources such as: vacuum tube shot noise, radioactive decay, neon lamp discharge, 
clock jitter, and PC hard drive fluctuations. 

We further define these RNGs as hybrid RNGs since they comprise an analog 
noise source followed by digital post-processing. The post-processing greatly 
enhances the entropy (statistical independence) of the output, usually at the cost of an 
acceptable reduction in bit-rate. The post-processing could also be termed digital 
nonlinear filtering or lossy compression. Finally, we further class the RNGs as either 
chaotic or non-chaotic. 

Reliable, high speed, and low cost analog random number generation of highly 
entropic bits is a hard problem. This is because practical noise sources are a few 
microvolts while other co-resident, typically fast-transitioning digital signals are 
several volts. Even greatly attenuated interference from these deterministic sources 
can rob the RNG output of most or all of the entropy that a cryptographic application 
may depend on. Amplification and sampling of noise signals can further degrade the 
entropy because the amplifiers are inevitably band-limited, and sampling thresholds 
are typically biased. By applying some results from chaotic processes and signal 
theory we have been able to overcome the problems mentioned above, producing 
reliable, high-speed, low-cost, highly entropic RNGs whose performance is supported 
by strong theory. 

In particular, we study the effect of filtering on sampled analog noise sources. We 
demonstrate that under ideal conditions, a relatively narrow-band noise source can be 
used to produce a perfectly uncorrelated bit-stream. Seeking more practical solutions, 
we demonstrate how simple digital feedback processes can be used to improve RNG 
statistics and to nullify the effects of certain spurious noise sources. Finally, we 
demonstrate how digital feedback, directly interacting with a chaotic amplifier, can 
produce a noise source that coerces other spurious noise sources to contribute their 
entropy to the main source, rather than rob that source of its entropy. We prove a 
lower bound for the amount of entropy per bit that such a chaotic source will produce, 
we calculate the probability density function for the source, and we discuss how to 
use this source to compress n bits of entropy into a vector of length n. 

We believe that the results provided here can help designers include high quality, 
stand-alone, non-deterministic RNGs in low-cost crypto-modules and ICs. 



1 Signal Theory Applied to RNGs 



We begin by showing that it is possible to produce very good, random bit sequences 
of completely independent values by carefully filtering and sampling a non-white 
natural noise source. We show how bandwidth limitations reduce to bit-rate 
limitations. A classic natural random number generator is modeled as follows: 
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Fig. 1. Classic Random Number Generator 



WGN is Stationary White Gaussian Noise that we assume is wide-band and low- 
power (such as thermal or shot noise). H is the transfer function of a band-limited 
amplifier. C is a comparator or quantizer function, and S is a sampling process that 
samples every interval of length t. 

Generally one can find wide-band, low-power white noise sources, but to work 
with them considerable amplification is needed. An inevitable limitation of bandwidth 
results. The signal x(t) is expressed in terms of the Dirac delta function 5(t): 

x(t)= □ w(t)^5(t - k r) ( 1 ) 

k 



summing over all integers k. The Fourier Transform of x(t) is then 

X(f)- g(t)e'J^'^Mt = Qy(t)(5(t-kr)e'-’^'*Mt = □w(kr)e'-’^'^^ (2) 

R R k k 



Using the Poisson summation formula on the last expression, we get: 

1 

X(/) = -UW(f-k/r) ^ ^ 

r k 

For the moment, ignore the effect of the quantizer, and note that if the shape of W is 
completely determined by the filtering of the amplifier H, we can arrange that 
W(f) = 0 for |f| > 1/t by selecting the sampling rate to match the amplifier’s rolloff. 
In addition, if the amplifier characteristic is equalized so that W(x/2-f) = W(x/2+f), 
then the right hand side of Equation (3) is a constant, indicating white noise, and 
therefore x(t) will be completely uncorrelated. Again, if we ignore the effects of the 
quantizer, then by the Gaussian assumption we can conclude that the values x(ut) are 
independent. We have shown that we can apply the Nyquist rule of thumb: “Make the 
sampling rate about twice the bandwidth,” and we have shown that we can carefully 
filter a stationary Gaussian narrowband noise source to completely eliminate 
correlation. Nyquist theory refers to sampling theorems in hybrid analog and digital 
systems where the goal is to eliminate the effects of aliasing and to reduce 
intersymbol interference. We have shown that it applies just as well to sampled noise 
signals where the performance criterion is intersymbol correlation. It is also clear 
from the above analysis that if the original noise source is non-white, an equalizer W 
can be still be designed so that the sum in Equation (3) is constant. 

Recall that the effect of the quantizer has thus far been ignored. The theory is much 
more involved for most common quantizers. We will treat one common and useful 
case in the next section. 
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2 Practical and Simple Examples 

In order to economically manufacture a good natural random number generator, we 
have to use some simpler digital filtering techniques, shooting for less than Nyquist 
precision. We show how simple digital filtering and sampling techniques can reduce 
correlation. Some examples of RNGs with and without digital post-processing can be 
found in Murry [1], Bendat [2], Boyes [3], Castanie [4], and Morgan [5]. 

Referring to Figure 1, we assume w(t) has zero mean with a power spectral density 
function W (f), and we use a simple two-pole amplifier with non-Nyquist filtering. W 
(f) rolls off from a flat spectrum at f and fj,, the lower and upper cutoff frequencies of 
the amplifier. Let us also consider the effect of the quantizer function C. Here we 
assume that C is an infinite clipper. That is, C assigns the value +1 to a positive 
voltage and -1 to a negative voltage. We also assume the comparator has a bias with 
an offset voltage A Let x^=x(ut). We are interested in values: /i^ = E[x^], 
p^(i) = E[Xj^ ^n+i]* can be expressed in terms of 

P^(t)- E[w(t) w(t+r)]. The mean of the process x is a straightforward error 

function approximated by // = 0.4 A/a (when the offset is small enough compared to 
the signal power), and with the aid of Price’s theorem [6], the autocorrelation function 
can be expressed in closed form as: 

71 

where 



£,exp(-2rf,r)-f,exp(-2- f, ) (5) 

P f f 

For a typical selection of components for a low cost RNG as modeled above, we get 
unsatisfactory mean and correlation values even if we very carefully isolate the RNG 
components from spurious noise sources. Thus we are motivated to use some simple 
digital filtering techniques. First, suppose we follow the sampler in Figure 1 with a 
simple feedback loop, where the analog noise source below contains the sampler: 




Fig. 2. Digital Processing 



We note that if the delay D is one clock cycle then . After a short period of 

time, Py is going to be extremely small even if the x3s are highly biased and strongly 
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correlated. This is clearly true when the x are independent. More generally, we can 
show thatE[yJ— >0 almost as quickly as (p^(l)+ — >0, depending on the 

characteristics of w(t)’s autocorrelation function, which we assume is 

asymptotically well-behaved and monotonically decreasing (as in the case of the two 
pole filter we have assumed). The difficulty here is that the autocorrelation p^(l) = 

is unacceptable. Note that the effect of the feedback loop is just to shift values of 
lower order statistics to higher order statistics. Now consider the sequence z^. We 
sample the output w^ at a rate f^, = f/r where f^ is the sampling rate producing Xj^. Let 
the function D delay the feedback by d clock cycles, then 

E[z.z.J = E[(x,jX,. . .)] (6) 

We choose d to be relatively prime to r. Then when d does not divide k, we see that 
there are no duplications in the subscripts in the above expression, and so there are no 
symbolic cancellations of the x values, and so p/k) is the expectation of the product 
of a large number of samples of x which grows larger as n grows large. As for the 
case when d divides k: If we set m=k/d, then 

E[zz J = E[y = E[x^x^., . . . (7) 

Therefore, this correlation is the expectation of the product of rm bits each spaced d 
apart. This works very well with a decreasing autocorrelation function for w(t). In 
cases where the acf P^{t) decreases slowly, then the value of d. should be increased 

to compensate. Heuristically, we are taking the expectation of the product of many ±1 
values spaced far apart in time. For a typical acf, increasing the spacing will effect an 
exponential reduction in expectation, and increasing the number of bits will also cause 
an exponential drop. Thus increasing d and r serve to reduce the expectation 
synergistically and powerfully. Both of these techniques are novel. With reasonable 
assumptions on n(t) and w(t), we can show that |/y(l)| < |b^(p^( 1>^ given B^ : 

k=0 

a,^(f)c!/((k/2)!W“) (9) 

The actual closed form expressions for p^{r) are difficult to analyze asymptotically. 
We use Price’s theorem [6] to calculate the autocorrelation function of the output of 
the infinite clipper, and our expressions include a determinant of an autocorrelation 
matrix whose entries are values of the autocorrelation function for the process n(t) at 
times nx (see Figure 1 again). For the example where n(t) is flat noise filtered by a 
two-pole filter the autocorrelation function magnitude, (t)\, is eventually 
monotonically decreasing, and thus we can estimate bounds for p^(k). Overall, the key 
to improving the statistics is selection of the sampling rates S and S’ as we show next. 

Thus, we can produce a sequence Zj^ with zero mean (after a brief transient), and 
unnoticeable correlation. In fact, the bound given above also provides a measure of 
independence in that |log(l - E[x^x 2 ...xj)| bounds the average mutual information 
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between bits Zj and Thus, this system approximates a sequence of equiprobable 
mutually independent (Bernoulli) samples of events from a sample space {1, -1} 
which can be produced using simple components with very modest performance 
characteristics. 

Suppose we use a Western Electric WE-459G noise diode as the WGN source. If s 
output is 0.8 pV/sqrt(Hz) with a power spectral density that is flat ± 2 dB from 100 Hz to 
500 kHz. We use a comparator with offset of A = 10 mV, and an amplifier with voltage 
gain of 100 and a flat transfer function from 100 Hz to 10 kHz. The mean will then be 
= 0.05. For a sampler S with frequency f = 10 kHz, then the two-pole amplifier model 
above predicts all covariances between and x^^^ for j between 1 and 50 to be on the 
order of 10'^ down to 10 Applying the loop model, we “oversample” with f = 30 kHz, 
and then set the second sampler frequency to f^, = 10 kHz. Thus r=3, and we choose a 
delay of 256 bits. The output sequence Zj then has zero mean, and all correlations are 0, 
except pk(k) ~ (0.05)^"" for k = 256m. 

In general, our experience shows that the above structure works extremely well 
under the assumption that we are reasonably faithful to the model, and we are careful 
to isolate the analog source from coupling effects from the digital filter components 
and other on-board components. This latter requirement is either difficult or 
expensive to satisfy, but it turns out that the same techniques mentioned above 
employing a double-loop topology will mitigate the effects of such coupling. 



3 Reducing Coupling Effects by a Double Loop 

Measured statistics on the output of an implemented single-loop RNG showed a small 
mean bias. This result violated the above theory and we attributed this effect to 
coupling from the high-level digital output into the analog noise source. Coupling 
between the digital output and analog input is denoted by 8. Since the digital output is 
5 to 6 orders of magnitude larger than the analog noise levels, the coupling effect will 
be significant in practical designs and will place a bias on the y„, independent of the 
sub-sampling ratio. This places a fundamental upper limit on" the output entropy. 
However, this limit can be surmounted by placing two loops in tandem: 




Fig. 3. Double-Loop Coupliug 

The second loop exponentially mitigates the 8^ coupling effect and that the first loop 
will similarly reduce the 8^ coupling effect. In the first case, we can model the noise 
source and first loop as a single noise source with some mean bias induced by 8^. The 
second loop will greatly enMnce the bit independence as shown in the single-loop 
RNG analyses in the preceding sections. In the second case, the mean bias on the 
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noise source output induced by 8^ will just be treated as another noise source bias by 
the first loop. In fact, we have found that two loops are enough; three or more loops 
produce similar results. Again, this result is novel. 

No mean bias was observed at the output of a dual-loop RNG. We have 
implemented this dual-loop RNG in a single IC that includes fault-tolerance and 
testability features. The performance of this device bears out the theory. 



4 Chaotic RNG 



This section departs from the above theory in that it treats a radically different type of 
RNG termed a chaotic RNG. Due to the great disparity between analog noise and 
digital signal levels, it is difficult (expensive) to ensure that interference of 
undesirable (low entropic) character will not dominate the analog noise source output. 
This dominance would nullify the beneficial effects of the various techniques 
described above. To free ourselves of this constraint, and other constraints imposed 
by other low-entropic interferers such as 1/f noise, we developed a chaotic RNG. The 
chaotic RNG has the advantageous property of accepting all noises, good and bad, 
and extracting their entropies. We discovered this idea by observing that the LSBs of 
high-resolution A/Ds tend to yield independent bits, regardless of the statistical nature 
and amplitude of the “desired” signal being converted. High resolution A/Ds require 
much hardware; this hardware can be sharply minimized by implementing the A/D in 
a loop with a 1-bit quantizer: 



ANALOG CLOCK 




Fig. 4. Chaotic RNG 



The selection of the RHP (right-half -plane) pole and the clock frequency determine 
the loop gain. A, of the “A/D”. A standard, radix-2 A/D can be implemented by setting 
the loop gain at 2 and by setting the “analog noise source” to a fixed voltage. 

Setting A to unity and replacing the “fixed voltage” by a time-varying signal 
implements a sigma-delta modulator. Finally, increasing A to somewhere between 1 and 
2, and replacing the “time varying signal” with an analog noise source creates a chaotic 
RNG. 

Electrical engineers are taught, almost from birth, to avoid poles in the right half 
plane. However, the loop’s negative feedback creates a stable overall response that 
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circumvents the RHP instability. We chose an RHP pole design since it provided the 
simplest implementation using discrete parts. In particular, the RHP pole comprises 
an OP-AMP (operational amplifier), capacitor, and a few resistors: 




Fig. 5. RHP Pole Circuit 

Referring back to Figure 4, we can immediately draw up an equation describing the 
evolution of voltage at the OP-AMP ’s output. We will call the normalized RHP pole 
output voltage, at sample time nT, b„ . Note that this voltage at time (n+l)T is a linear 
combination of the voltage at time n, b„, the sign of the voltage at time n, sgn[bj, 
thermal noise, g„ and interference, s„ . The first term is the initial state for the RHP 
filter, and the latter three terms are inputs accumulated by the RHP filter during the 
[nT, (n+l)T] period. The RHP pole will increase its initial state over this time period 
by a factor of A. After normalization, the voltage is described as 

^+1 = g„+ ( 10 ) 

For convenience, we will often use c„ as shorthand for sgn[bj . 



4.1 What Is Chaos? 

Chaos can be described as a response that grows exponentially larger with time due to 
an arbitrarily small perturbation. A good introductory description to chaos is 
Schuster [7]. Note that its title is “Deterministic Chaos”. In fact, it is the marriage of 
(analog) deterministic chaos with an analog noise source that engenders a potent 
random number generator. 

Mathematically, a positive Lyapunov exponent defines chaos. In discrete time, the 
Lyapunov exponent is defined as the averaged logarithm of the absolute value of gain 
each cycle. For our chaotic RNG, the gain A is constant, so the Lyapunov exponent is 
ln(A) . Since A > 1 , the exponent is positive, verifying chaos. 
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4.2 Other Chaotic RNGs 

Bernstein [8] and Espejo-Meana [9] describe two (of many) possible implementations 
of chaotic RNGs. Of the two, Espejo-Meana is the most similar to the implementation 
described here. 



4.3 Why Is Chaos Good for Random Number Generation? 

Chaos guarantees that any noise contribution, no matter how small and how buried in 
deterministic interference, will ultimately significantly effect the output bits since the 
noise’s effect increases exponentially. This means that we can greatly relax the 
isolation requirements on the analog noise source. As long as the deterministic or 
low-entropic interference does not lower the Lyapunov exponent by causing OP-AMP 
saturation, chaotic operation will occur. In fact, we built the above circuit with both 
analog and digital circuitry powered by the same +5V supply (which also powered 
much other digital circuitry). We observed no interference with chaotic operation. 

This RNG employs the simplest possible topology for a chaotic RNG implemented 
in discretes, has a constant Lyapunov exponent, and is therefore (relatively) easily 
analyzed. In Appendix A, we calculate a lower bound on the output bit entropy, 
expressed in bits: 

H,, = {N-\)\og,{A)- A-^)- \og,[a^y 1.77 

Here N denotes the number of successive output bits collected and denotes the 
noise standard deviation. Note that for large N, 

H,, « A^log,(^) (12) 

We have implemented this chaotic RNG and have verified that its output entropy 
approximates this lower bound to the precision we could measure. The output entropy 
is guaranteed in the sense that it will always be greater than the bound expressed in 
Equation 11, independent of any non-saturating interfering signal. This is a very 
important property for an RNG. In contrast, typical non-chaotic designs are plagued 
with very difficult isolation issues, tight tolerance on parameter matches, or clock 
phase-locking. Chaotic designs are often plagued with regions with negative 
Lyapunov exponents. Finally, both chaotic and non-chaotic designs can often be very 
difficult to analyze. 



4.4 Output Whitening 

The derivation in Appendix A suggests a particular form of post-processing to 
provide independence. Specifically, we proved that the two sides of Equation 13 are 
asymptotically equal: 

AT-l AT-1 A?"-! 

A :=0 A :=0 



(13) 
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where = sgn[bj and the denote the interfering signal(s). The LHS comprises a 
quantizer that represents accurately a Gaussian random variable with some mean 
arising from the interference. The post-processor can then re-express the LHS as a 
binary number, which will comprise a standard binaiy-weighted A/D converter. 
Selecting a (large) subset of the bits of this binary number will yield a nearly 
independent bit-stream. Heuristically, the MSBs are not independent since they are 
heavily influenced by the signal’s distribution. Also, the LSBs after some point 
cannot convey any additional entropy, since only N log^CA) bits of entropy are 
available. Thus at this point, these LSBs become deterministically related to the prior 
bits. This leaves us with a mid-range of bits that are independent. Of course other 
whitening methods such as post-processing via a hash function or DBS are always 
valid. 



4.5 Concluding Remarks on Chaotic RNG 

We believe that the following are novel with regard to this type of chaotic RNG: use 
and implementation of RHP pole, calculation of entropy lower bound, realization that 
this lower bound is independent of external interference, form of whitening filter, and 
the derivation of a probability distribution. 
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Appendix: Entropy Calculation 



iV-i 

Equation (10) is cast into an equivalent form by applying the filter 



= Ab^ + +Sk-c^) 



(14) 



Due to the negative feedback via the {cj, |bj < 1 for all n. Thus the RHS above is 
bounded in amplitude by A . In other words, with maximum error 



UA-^c,^Ab, 



k^O 



N-l N-l 

+ U4-^^.+ U A-'g. 

k^O k^O 



(15) 



For fixed N, the RHS of Equation (A2) is the sum of: 

1. The initial condition, Ab„ 

2. A possibly large term due to the extraneous interference called S : ^ _ y 



3 . A zero-mean Gaussian random variable: a-i 

& 

k=0 

We cannot rely on Ab^ and S to supply entropy, at least entropy that is unknown to 
an adversary who is attempting to break a cryptosystem. The initial condition b^ may 
be largely deterministic if it is defined as the initial value of b„ just after the circuitry 
has been powered-up or supplied with a clock. Moreover, b^ may be correlated to the 
previous exercise of the RNG, thereby reducing its entropy. Since we cannot specify 
what entropy that S will have that is unknown to the adversary, we will assume 
conservatively that S is deterministic. Therefore, for the remainder of this argument, 
we conservatively model the RHS of Equation (15) as a Gaussian random variable 
with mean (Ab^ + S) and standard deviation 







(16) 



The LHS of Equation (15) is the (scaled) quantized value of the RHS in the sense 
that the {cj assume values of {-1,1} which can be mapped into binary ones and 
zeros. The quantizer defined by the set (cj has the property that it represents the 
RHS of this equation with a maximum error of A *^ . There is an infinite set of 

quantizers that have this same property. Generally, these quantizers would have 
different entropies. The minimum-entropy quantizer with this property is one with as 
few quantization steps as possible, namely one that uniformly spans [-1, +1] with step 
size The entropy of this minimum-entropy quantizer is thus a lower bound on 

entropy of the (cj quantizer. We calculate this lower bound here: 

Divide the [-1, +1] range into 2M levels, M positive and M negative. Then, to a 
high accuracy since M is very large. 




( 17 ) 
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The entropy (in nats) of this quantizer is 



exp 
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u 



( 18 ) 



/ denotes the mean value of the RHS of Equation (A2): / = +S)M • The tails of 

the Gaussian pdf have been ignored since is small. Expanding the natural 
logarithm in Equation (18) gives 



^ u 
u 






„pu ii 

= 2D 



cTqM 



ln(M) + ln(^V^ cr^)“ 



y-y i| 



2cr^M" 



(19) 



The first two terms in the brackets are independent of i and equivalently pre-multiply 
the summation operator. The weighting function, exp( ) is just a pdf which sums to 
oneh The last term in the brackets, when summed, very closely approximates the 

following integral where “ = _L ^ + y : 

M ' 



2 

0 



exp. 






u 



2<jt 



4^ < 



yijx-xf 

|LI 2 cri 



\dx 



( 20 ) 



The integral approximates unity due to the small Therefore, the entropy is 

H„a,s = ln(V) + ln(V^ CTp,) +1 (21) 

Substituting for and M from Equations (16) and (17), and converting to bits yields 
H, = (A^-l)log,(y- hog.y aA 



iog,(^T 



1.77 



( 22 ) 



^ Again ignoring the tails of the Gaussian since is small 
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Abstract. Cryptographic algorithms are more efficiently implemented 
in custom hardware than in software running on general-purpose proces- 
sors. However, systems which use hardware implementations have sig- 
nificant drawbacks: they are unable to respond to flaws discovered in 
the implemented algorithm or to changes in standards. In this paper we 
show how reconfigurable computing offers high performance yet flexible 
solutions for cryptographic algorithms. We focus on PipeRench, a recon- 
figurable fabric that supports implementations which can yield better 
than custom-hardware performance and yet maintains all the flexibility 
of software based systems. PipeRench is a pipelined reconfigurable fabric 
which virtualizes hardware, enabling large circuits to be run on limi- 
ted physical hardware. We present implementations for Crypton, IDEA, 
RC6, and Twoflsh on PipeRench and an extension of PipeRench, Pipe- 
Rench+. We also describe how various proposed AES algorithms could 
be implemented on PipeRench. PipeRench achieves speedups of between 
2x and 12x over conventional processors. 



1 Introduction 

Most cryptographic algorithms function more efficiently when implemented in 
hardware than in software. This is largely because customized hardware can 
take advantage of bit-level and instruction-level parallelism that is not accessi- 
ble to general-purpose processors. Hardware implementations, lacking flexibility, 
can only offer a fixed number of algorithms to system designers. In this paper 
we describe a reconfigurable fabric which delivers high performance hardware 
implementations with the flexibility of general-purpose processors. 

The efficiency of an implementation is directly related to the degree to which 
it is customized to perform a given task. Hardware implementations are even 
more efficient when they are customized for a specific instance of an algorithm. 
For example, a hardware multiplier with one constant operand will generally 
take much less area than a general-purpose two operand multiplier. 
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Of course implementing circuits with such a high degree of specificity in VLSI 
is generally infeasible because the cost of development and manufacturing must 
be offset by the chip’s applicability. Furthermore, to be responsive, a system must 
have some control over its embedded algorithms. For example, if a particular 
algorithm is discovered to be insecure, the system is rendered useless unless 
a different algorithm can be implemented. Reconfigurable hardware strikes a 
balance between customization and performance on the one hand and flexibility 
and cost on the other hand by permitting any algorithm to be highly customized. 

Reconfigurable hardware is a general term that applies to any device which 
can be configured, at run-time, to implement a function as a hardware circuit. 
Reconfigurable devices occupy a middle ground between traditional computing 
devices, e.g., microprocessors, and custom hardware. Microprocessors compute 
a function over time by multiplexing a limited amount of hardware using in- 
structions and registers. They are thus general-purpose and can compute many 
different functions. At the other end of the spectrum, custom hardware is used to 
implement a single function, fixed at chip fabrication time. A reconfigurable de- 
vice, of which the most common is a Field Programmable Gate Array (FPGA), 
has sufficient logic and routing resources that it can be configured, or program- 
med, to compute a large set of functions in space. Later, it can be re-programmed 
to perform a different set of functions. It shares attributes of microprocessors, 
in that it can be programmed post-fabrication, and of custom hardware, in that 
it can implement a circuit directly; avoiding the need to multiplex hardware. 

The primary ways in which reconfigurable devices are tailored to an appli- 
cation are by matching application parallelism with as many function units as 
needed, by sizing function units to the word size of the application, by creating 
customized instructions, by introducing pipelining, and, by eliminating control 
overhead associated with the multiplexing of function units as in a microproces- 
sor. 

In the next section, we describe how reconfigurable computing devices can 
achieve the efficiency of highly customized designs while maintaining both cost- 
effectiveness and security. Section 3 focuses on how the components of typical 
cryptographic algorithms map to reconfigurable devices. Section 4 describes a 
pipelined reconfigurable device called PipeRench which overcomes many of the 
problems of using commercial FPGAs to implement datapaths. In particular 
PipeRench supports hardware virtualization which, like virtual memory, allows 
designs that do not fit on the physical device to run. Section 5 describes our 
implementations of several algorithms on PipeRench and our support of on- 
the-fiy customization even in embedded systems. Related work is covered in 
Section 6. We conclude in Section 7. 

2 Reconfigurable Computing 

Functions for which a reconfigurable fabric can provide a significant benefit ex- 
hibit one or more of the following features: 
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1. The function operates on bit-widths that are different from the processor’s 
basic word size. 

2. The data dependencies in the function allow multiple function units to ope- 
rate in parallel. 

3. The function is composed of a series of basic operations that can be combined 
into a single specialized operation. 

4. The function can be pipelined. 

5. Constant propagation can be performed, reducing the complexity of the 
operations. 

6. The input values are reused many times within the computation. 

These functions take two forms. Stream-based functions process a large data 
input stream and produce a large data output stream, while custom instructions 
take a few inputs and produce a few outputs. Notice that cryptographic algo- 
rithms possess many of the features described above. They can be implemented 
as stream-based functions which run completely on a reconfigurable device, or, 
when impractical to implement completely on the a reconfigurable device, pieces 
of them can be implemented on the reconfigurable device as custom instructions. 
After presenting a simple example of a custom instruction to illustrate how a 
reconfigurable fabric can improve performance, we discuss the ways in which a 
fabric can be integrated into a complete system. 

2.1 Custom Instructions: The Permutation from TwoFish 

In Twofish [27], in order to generate the key dependent S-boxes, multiple invo- 
cations of the q function are required. This function combines XOR, rotation, 
bit truncation, and table lookups. One way to accelerate the creation of the key 
dependent S-boxes is to implement a custom instruction, the g-instruction, on 
a reconfigurable fabric. This instruction takes an 8-bit operand and produces 
an 8-bit result. The custom instruction exploits the ability of the reconfigurable 
fabric to operate on small bit-width operands (4-bits), to execute many opera- 
tions in parallel, and to combine a sequence of operations into a single operator 
(through the use of lookup tables). 

2.2 A System Architecture 

Reconfigurable fabrics enhance performance mainly by providing the compu- 
tational datapath with more flexibility. Their utility and applicability is thus 
influenced by the manner in which they are integrated into the datapath. We 
recognize three basic ways in which a fabric may be integrated into a system: 
as an attached processor on the I/O or memory bus, as a co-processor, or as a 
functional unit on the main CPU. They are most widely useful when integrated 
into the processor as a reconfigurable function unit (RFU). The RFU has access 
to both the register file and the primary cache. The main reason for this is that 
the they may be used to implement custom instructions which can operate on 
data in the processor registers. Furthermore, the bandwidth between the fabric 
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and the processor (and the data in the processor’s cache) is highest when the 
fabric can directly access the cache. As we will show in the rest of the paper, this 
organization leads to a system which can significantly enhance the performance 
of all cryptographic algorithms. 

A fourth possible system organization is the system-on-a-chip approach used 
in embedded computing systems. In such an organization the fabric is closely 
coupled with a processor, but not so tightly coupled as to be on the processors 
datapath. 



3 Cipher Components 

Most ciphers can be specified as dataflow graphs consisting of a few different 
components. In this section we will enumerate the most common of these com- 
ponents and discuss how they map onto reconfigurable hardware. 

• Simple Arithmetic Operations 

Simple operations such as addition and subtraction appear frequently in cryp- 
tographic algorithms. These operations map easily to hardware, but due to 
their simplicity they offer no real gain for reconfigurable systems. 

• Narrow and Unusual Bit-widths 

Operations involving narrow bit-widths appear often in stream ciphers, and 
they are important in highly customized ciphers of any type. Standard micro- 
processors are notoriously bad at performing narrow bit-width operations, 
particularly if the values are not multiples of the natural word length of the ar- 
chitecture. Customized hardware supports operations on values of any width, 
avoiding the computation of unneeded values and the costly masking of un- 
desired bits. Implementing a highly customized design with a constant key 
allows all datapaths to be reduced to their minimum widths, eliminating the 
need for paths wide enough to support all possible key combinations. 

• Multiplication 

Multiplication is a difficult task to perform in hardware, in that simple hard- 
ware multipliers consume a large amount of hardware and compute their re- 
sults very slowly. Because there are many different ways to improve their per- 
formance, multipliers are a prime candidate for optimization and acceleration. 
Here we consider three different types of multiplier. 

• General-Purpose 

General-purpose multipliers (where both operands may take any value) are 
costly to implement in hardware. However, in many cryptographic algo- 
rithms, the result of nxn multiplies is often only n bits wide. On a reconfigu- 
rable device, the size and number of the adders can be reduced accordingly, 
eliminating the need to compute bits which are later ignored. 

• Multiplication by a Constant 

Implementing highly customized cryptographic hardware, for example when 
the key has been set to a constant value, can serve to change many (or 
alB of the general-purpose multipliers in a design into constant multipliers 
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(multipliers where one operand is a constant). Constant multipliers can 
be made considerably smaller and faster in hardware than general-purpose 
multipliers. 

Suppose that one operand of a multiplier set to a constant. The multiplier 
requires only as many partial products as there are Ts in the constant 
operand. On average, single-operand multipliers of this type are half the 
size and twice as fast as their general-purpose counterparts. 

• Multiplication Using a Redundant Coding Scheme 

A great deal of space can be saved when performing constant multiplication 
through the use of a redundant coding scheme. For example, it is straight- 
forward to transform a constant into canonical signed digit, or CSD, form. 
CSD vectors reduce the number of partial products needed for multiplica- 
tion by permitting bits in the constant operand to take on negative values. 
For example, the number 7 in binary is 0111, or 2^ T 2^ T 2*^. Multiplication 
by this constant requires three partial products; one for each 1 in the binary 
representation. The CSD representation of 7, however, is 100( — 1), or 2^ — 2*^. 
Multiplication by this constant vector requires only two partial products. 
As long as addition and subtraction take the same amount of time, no 
hardware overhead is incurred in implementing this type of multiplier. On 
average, a constant CSD multiplier will be about 75% smaller than a general- 
purpose multiplier because the number of partial products in constant CSD 
multipliers scales with the number of sequences of ones in the original con- 
stant. 

Parallel Logical Operations 

Hardware allows many logical operations to be performed in parallel. This 
instruction level parallelism is one of the fundamental advantages of hardware 
over software in computation. Reconfigurable devices can be programmed to 
perform such complex logical operations in parallel, harnessing all the paralle- 
lism available to a hardware implementation. Furthermore, since the number 
and kind of function units needed at any point in the computation is confi- 
gured for the application, the parallelism is never artificially constrained by a 
lack of function units (as might happen in a VLIW architecture, for example). 
Sequences of Logical Operations 

Most reconfigurable architectures, including standard commercial Xilinx 
FPGAs the PipeRench architecture discussed later in this paper, implement 
function units using lookup tables. Thus a sequence of operators can often be 
combined into a single operator by setting the lookup-table appropriately. 
Table Lookup 

Most block ciphers include a substitution box, or S-box. S-boxes are generally 
not easily expressible as linear transformations and are therefore implemented 
as table look-ups. Many reconfigurable architectures can implement tables of 
this kind, while others may need external scratch memory to store the S-box 
values. 

Rotation and Shifting 

Lastly, two very common operations in cryptography are bitwise shifts and ro- 
tations. Microprocessors, particularly if programmed in C, are very inefficient 
at performing: operations of this type. 




236 



R.R. Taylor and S.C. Goldstein 



P*B bits 




Fig. 1. Hardware virtualization in PipeRench overlaps computation with reconfigura- 
tion and provides the illusion of unlimited hardware resources. 

Hardware, on the other hand, can shift and rotate numbers easily. Variable 
shifts and rotates can be accomplished with barrel shifters, and constant shifts 
and rotates do not require any resources at all, as they can be achieved by 
simply reordering the actual wires. 

Reconfigurable hardware can accomplish all of the benefits associated with 
hardware while providing even more opportunities for optimization. In hig- 
hly customized designs such as fixed-key implementations, variable shifts and 
rotations may become fixed, reducing running time and freeing resources. 



4 PipeRench 

PipeRench is a reconfigurable fabric being developed at CMU. It is an instance 
of the class of pipelined reconfigurable fabrics [26]. From the point of view of 
implementing cryptographic algorithms the three most important characteri- 
stics of PipeRench are: it supports hardware virtualization, it is optimized to 
create pipelined datapaths for word-based computations, and it has zero appa- 
rent configuration time. Hardware virtualization allows PipeRench to efficiently 
execute configurations larger than the size of the physical fabric, which relieves 
the compiler or designer from the onerous task of fitting the configuration into 
a fixed-size fabric. PipeRench achieves hardware virtualization by structuring 
the fabric (and configurations) into pipeline stages, or stripes. The stripes of an 
application are time multiplexed onto the physical stripes (see Figure 1). This 
requires that every physical stripe be identical. It also restricts the computations 
it can support to those in which the state in any pipeline stage is a function of 
the current state of that stage and the current state of the previous stage in 
the pipeline. In other words, the dataflow graph of the computation cannot have 
long cvcles. 
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Fig. 2. The interconnection network bet- 
ween two adjacent stripes. All switching 
is done at the word level. All thick arrows 
denote B-bit wide connections. 



from pass-registers 



Inter-Stripe Intereonneet 




Fig. 3. The structure of a processing ele- 
ment. There are N PEs in each stripe. De- 
tails about the zero- detect logic, the fast 
carry chain and other circuitry are left 
out. 



Each stripe in PipeRench is composed of N processing elements (PEs). In 
turn, each PE is composed of B identically configured 3-LUTS, P B-hit pass 
registers, and some control logic. The three inputs to the LUTS are divided into 
two data inputs (A and B) and a control input similar to [8]. Each stripe has an 
associated inter-stripe interconnect used to route values to the next stripe and 
also to route values to other PEs in the same stripe. An additional interconnect, 
the pass-register interconnect^ allows the values of all the pass registers to be 
transferred to the pass registers of the PE in the same column of the next stripe. 

The structure of the interconnect is depicted in Eigures 2 and 3. Both the 
inter-stripe interconnect and the pass-register interconnect switch B-hit wide 
buses, not individual bits. A limited set of bit permutations are supported in the 
interconnect by barrel shifters, which can left shift any input coming from the 
inter-stripe interconnect. Currently, the inter-stripe interconnect is implemented 
as a full crossbar. 

Erom the perspective of cryptographic algorithms, the current version of Pi- 
peRench has one significant drawback: It cannot perform large table lookups. 
Thus S-boxes with more than a few entries cannot be efficiently supported direc- 
tly in the current version of PipeRench. One proposed extension to PipeRench, 
PipeRenchT, allows the individual stripes to make memory accesses. This would 
allow PipeRench to efficiently support S-boxes and thus all the operations listed 
in Section 3. Without the memory extension, algorithms with S-box operati- 
ons would be best supported bv decomposing the algorithm into pieces where 
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the non-S-box portions are implemented as custom instructions and the S-box 
lookups are performed in the processor core. 

The performance numbers we use in this paper are for an implementation in a 
0.25 micron process. After an analysis described in [12] we determined that each 
stripe will have 16 8- bit PEs, yielding a 12 8- bit wide stripe. Each PE contains 
8 pass registers. The final chip will use lOOmm^ for 28 stripes and an on-chip 
cache capable of holding more than 512 virtual stripes.^ 

Along with the development of the PipeRench fabric, a fast compilation 
framework was built [6]. Except where noted, all performance numbers are on 
simulations of a 28-stripe A = 16,i^ = 8,/^=8 instance of PipeRench running 
configurations created automatically by the compiler. Eor PipeRenchT, we con- 
sider PipeRench to be augmented by a small scratchpad memory of IK bytes. We 
consider two versions, PipeRenchT 16, which allows up to 16 simultaneous reads, 
and PipeRenchTfi, which supports up to 4 simultaneous reads. They increase 
the total area by 5% to 20%. 

5 Applications 

In this section we describe how IDEA, Crypton, RC6, and Twofish can be imple- 
mented on PipeRench, yielding high performance. We also describe how these 
algorithms would be aided by PipeRenchT- As an example of the flexibility of 
reconfigurable systems we describe how key-specific instances of IDEA can be 
easily created on the fly, without a compiler. We then evaluate reconfigurable 
implementations for the proposed algorithms of AES. 

5.1 IDEA 

The IDEA block cipher [28] is comprised entirely of three fundamental operations 
described in Section 3: addition modulo 2^®, 16-bit XOR, and 16x16 multipli- 
cation modulo 2^® T 1- The 128-bit key is used to generate 52 16-bit subkeys. 
Throughout the algorithm there are no backwards paths for data. In addition, 
one operand of every multiplication operation in the algorithm is a subkey, and 
for a highly-customized implementation it may be treated as a constant. 

This means that the algorithm maps exceptionally well onto PipeRench. The 
forward-only datapath permits the entire application to be constructed as a 
single, long virtual pipeline. PipeRench is sufficiently wide to receive one com- 
plete 64-bit cleartext block and to return one 64-bit ciphertext block per cycle. 

Multipliers are the best candidates for optimization in this algorithm. If 
implemented as general-purpose two-operand shift-and-add multipliers, they re- 
quire 16 partial products each. The modulo 2^® T 1 operation can be packed into 
one stripe and computed using only three operations [18]. 

A simple key-specific implementation can be created if the compiler is given 
the subkeys as constants. The compiler performs constant propagation reducing 

^ First silicon for a prototype of PipeRench implemented in 0.35 micron technology is 
expected in October 1999. It will have 16 stripes. 
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Table 1. Comparison of IDEA implementations. 



Processor 


Clock Speed 


Clocks 
per Block 


Throughput 

(MBytes/sec) 


PipeRench (template) 


100 MHz 


6.3 


126.6 


PipeRench (compiler) 


100 MHz 


12 


66.3 


Pentium-II using MMX [21] 


450 MHz 


358 


10.0 


Pentium [23] 


(scaled) 450 MHz 


590 


6.1 


IDEACrypt Kernel [22] 


100 MHz 


3 


90.0 



the number of partial products to an average of 8 per multiplier. Further opti- 
mization can be performed by transforming the shift-and-add multiplier into a 
constant CSD multiplier. 

In Table 1 we compare both the template- and compiler-generated IDEA to 
optimized software implementations running on state-of-the-art processors, and 
to custom VLSI designs. PipeRench outperforms the processors listed by over 
lOx. 

Somewhat surprisingly, PipeRench outperforms the .25 micron IDEACrypt 
Kernel from Ascom [22]. This is due to several factors: first, the PipeRench im- 
plementation of IDEA does not include the time taken to generate keys. This is 
because PipeRench targets streaming media applications, in which key genera- 
tion comprises only a small preprocessing step. Secondly, because of the pipelined 
nature of PipeRench, IDEA has effectively been pipelined into 177 stages. If a 
custom silicon implementation were built with such a high degree of pipelining, 
the circuit would allow a fast clock (at the cost of silicon area.) Lastly, there is 
a 177-cycle latency through the pipeline. Nonetheless, it is noteworthy that the 
raw throughput of PipeRench is 40% faster than full-custom silicon. 

5.2 IDEA in Embedded Systems 

One of the challenges of placing a PipeRench fabric running IDEA in an embed- 
ded system is to reduce the time to generate a single-key configuration. While 
the compiler for PipeRench can compile a complete, single-key optimized IDEA 
application in less than one minute, this is too long for an embedded system. 
One method for accomplishing this is the use of precompiled CSD multiplier 
templates. 

An 8-round IDEA pipeline (with the output transformation) contains 34 
16x16 bit constant multipliers. The vast majority of the compilation time in 
generating a single-key IDEA pipeline is spent propagating constants through 
these multipliers, reducing them to the minimum required number of partial 
products. This operation is very important since nearly all the efficiency gained 
by fixing the key is a result of this reduction. 

The task of compilation can be separated into two components: optimization 
and generation of the multipliers, and generation of the rest of the pipeline. If an 
interface is agreed upon bv the multipliers and the rest of the pipeline, the two 
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tasks can be performed independently. We have developed a system for creation 
of IDEA with template-based multiplication. 

The non-multiplier portions of the pipeline were hand-compiled beforehand 
to give maximum performance. They interact with the template-generated mul- 
tipliers according to a pre-defined interface, and they do not place any important 
data in the registers which are used by the multiplier. The multipliers are thus 
treated as black boxes, and the non-multiplier operations are wrapped tightly 
around the multipliers to perform all the necessary computations in the mini- 
mum possible number of stripes. 

To generate the multipliers themselves, a system is required that rapidly re- 
turns configuration bits for the stripes that perform the multiplication. Rather 
than expending a tremendous amount of effort and silicon area on hardware 
which actually computes these bitstreams, it is preferable to simply construct a 
lookup table which converts constant multiplicands into the necessary configu- 
ration bits. 

Such an implementation would consist of nothing more than a ROM pre- 
loaded with the appropriate values. To reduce the size of the ROM, each 16-bit 
constant is broken into two 8 -bit constants. The 8 -bit constant is used as an 
index into a table of stripe configurations which implement that portion of the 
multiplier. The ROM would need approximately 256 120-bit entries. Although 
there is some overhead when recombining the two portions of the constant, 
using CSD representations we can still build the entire multiplier in two or three 
stripes. (Three stripes are required only for certain “bad” CSD vectors, which 
occur in only 1/16 of the entries. The multiplier interface is maintained whether 
the multiplier needs two or three stripes.) 

Because the design of the template-generated multipliers and the logic placed 
between them only needs to be done once, great care can be taken to make 
the design very efficient. A single round of IDEA generated by this system is 
generally only 20 or 21 stripes long, resulting in a complete IDEA pipeline of 
only 177 stripes. This is a tremendous improvement over the 338-stripe compiler- 
generated pipeline. This improvement is primarily due to the compiler's not using 
registered feedback within a stripe. 

5.3 Crypton 

The Crypton [20] cipher can be implemented as a complete stream-function on 
PipeRench, with reasonable speedup. There are, however, operations within the 
Crypton cipher which are difficult to accomplish on the PipeRench architecture. 

Most parts of the cipher map easily onto PipeRench. The byte transposition 
r can be implemented entirely in the interconnect of PipeRench and does not 
require any computational resources at all. The bit permutations, tVq or TTe, can 
each be completed in four stripes. The key addition, a/^, takes only one stripe. 

The nonlinear S-box substitution, 7 , however, is not easy to implement on 
PipeRench. Each of the three small 4x4 Fj. s-boxes can be implemented either 
as logic or as a look-up table. In either case, because the PEs on PipeRench 
operate onlv on 8 -bit quantities, 7/8 of each PE is wasted. Each PE generates 
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Table 2. Comparison of different PipeRench systems and the speedups they achieve 
over the best versions in the cited papers. The ‘‘method’’ column indicates how the 
code was created: ’C’ indicates it was automatically generated by the compiler. ’A’ 
indicates a hand-coded version in CVHASM [15]. 



Cipher 


System 


method 


Clocks /block 


Throughput 

(Mbytes/sec) 


Speedup 


RC6 [25] 


PipeRench 


A 


28 


58.8 


4.7x 


Crypton [20] 


PipeRench 


A 


65 


24.8 


1.3x 




PipeRench+4 


A 


50 


32.5 


1.8x 




PipeRench+6 


A 


19 


86.8 


4.7x 


Twofish [27] 


PipeRench+4 


C 


51 


15.6 


2.8x 




PipeRench+16 


C 


15 


54.3 


9.7x 




PipeRench+4 


A 


36 


36.0 


3.9x 




PipeRench+16 


A 


9.7 


164.7 


14. 6x 



only a single bit, which is later combined by other PEs into a 4-bit quantity. 
When implemented by hand, this process requires three stripes per F^. 

A single round of Crypton uses 16 S-boxes, with three Fx units in each S- 
box. As a result, a single round of Crypton requires about 150 stripes, causing 
the entire 12-round cipher to occupy 1800 virtual stripes. PipeRench may have 
difficulty handling an application of that size due to limitations on the storage 
space for virtual stripes. The application can be re-pipelined on the inside in 
order to re-use the S-box in each round on all four 32-bit words. This cuts the 
length of the virtual pipeline by a factor of four, but it incurs a considerable 
amount of overhead in re-pipelining, and so reduces the overall throughput of 
the application.^ Even with this large number of stripes, the application still 
gives 24.8 MByte/sec of throughput (with tremendous latency), compared with 
18.46 MByte/sec on a 450 MHz Pentium Pro (coded with in-line assembly). 

When implemented on PipeRenchT, Crypton is significantly smaller and 
faster. Each of the S-boxes requires only a single stripe — the stripe that contains 
the load. Thus, the entire round takes only 24 stripes yielding a total of 288 
stripes. Due to the limit on memory accesses per cycle this implementation 
yields 87 MByte/sec on PipeRench+16. 

5.4 RC6 

RC6 [25] is easy to implement as a stream function, but is only 5x faster than 
a 200Mhz Pentium-Pro due to the general purpose 32-bit multiplies (2 in each 
round). Unlike IDEA, neither operand of the multiplier is a constant. However, 
because the multiplier result is only 32-bits, the size of the multiplier is reduced 
by half. The variable rotates also require a significant amount of hardware: six 
stripes to do both rotates in parallel. 

^ Another solution is to chain multiple PipeRench chips together making a bigger 
pipeline. This solution also doubles performance. 
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5.5 Twofish 

The Twofish [27] algorithm, like many others makes substantial use of S -boxes 
which is unsuitable for PipeRench. However, as described in Section 2.1, pieces 
of the algorithm can be mapped to custom instructions. For example, the q- 
function can be mapped to ten stripes on PipeRench. The space-consuming part 
of the function is again the four table lookups. In spite of this, using PipeRench 
to compute the S-boxes reduces the time for key setup making “full keying” a 
viable option even when very few blocks are encrypted with a single key. 

However, on PipeRenchT, the S-box lookups are easily handled. Each round 
requires 16 loads (from the go and gi functions) and some rotating, XORing, and 
addition all of which are easy to accomplish on PipeRenchT. Both the compiler 
and hand-coded versions achieve a speedup of about 3x on PipeRenchT 4 over the 
fastest assembly version running on a 200Mhz Pentium Pro/H. Interestingly, this 
is in spite of the fact that the compiler version is twice as large. This is because 
the compiler version spaces out the loads so there are very few stalls. The extra 
memory bandwidth of PipeRenchT 16 allows the hand-coded version to get more 
than 14x speedup. 



5.6 Other AES Algorithms 

We now discuss the AES algorithms which cannot be fully implemented on 
PipeRench. However, certain operations in the algorithms can be implemented 
as custom instructions, resulting in faster overall performance. In addition, most 
could be implemented on PipeRenchT- 

Cast-256 [1], like many of the proposed AES algorithms is heavily based on 
S-boxes which are not amenable to the current version of PipeRench. However, 
for each of the three keyed-round operations there are key-specific left rotations 
which can be optimized to constant rotations. Thus custom instructions can be 
created based on the subkeys which reduce the time to compute the S-box input 
to one cycle. 

Performance of DFC [33] would improve by implementing the necessary mul- 
tiprecision arithmetic as custom instructions. 

The Hasting Pudding Cipher [29] gains significant performance as a series of 
custom instructions. The basic operations are all easily performed on PipeRench, 
but the entire cipher cannot be implemented as a stream function due to the 
large tables needed to hold the key expansion. 

Both the key schedule and encryption/decryption in LOKI97 [5] can be sped 
up with a custom instruction which implements all but the S-box of the g- 
function and the data routing. 

Deal [17], E2 [31], FROG [11], MAGENTA [3] and MARS [7] appear to be 
unsuitable for implementation on PipeRench due to the use of large table lookups 
in both key formation and encryption/decryption. 

When encrypting or decrypting using Rijndael [10] each round consists of 4 
basic functions of which three can be implemented as custom instructions. 
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SAFER+ [9], like HPC, DFC, Cast-256, Rijndael, and LOKI97, can benefit 
from using custom instructions, although the entire cipher cannot be implemen- 
ted on PipeRench due to the large M matrix. 

Serpent [2] uses S-boxes for the entire process except for the initial and final 
permutations which could be custom instructions. 



6 Related Work 

There is a growing body of work on using reconfigurable devices to implement 
cryptographic algorithms. Reconfigurable implementations of DES [32,14] and 
RSA [30] have all achieved significant speedups over general-purpose processors. 
However, in none of these cases were key-specific hardware implementations 
generated. The impact on the hardware size and throughput of key-specific im- 
plementations of DES using Xilinx FPGAs is discussed in [19]. 

In [16] FPGA-based implementations of DES are described that make use 
of many of the helpful attributes of reconfigurable devices which are used on 
PipeRench, including loop unrolling (which PipeRench requires) and pipelining 
(which PipeRench does implicitly). The acceleratation of modular multiplication 
and exponentiation (as used in RSA) using arithmetic architectures which have 
been optimized for use on FPGAs is described in [4]. 

More generally, PipeRench is one of several approaches towards making re- 
configurable hardware more applicable to computation of the sort needed by 
cryptographic applications. PRISG [24] is among the earliest work on integra- 
ting a reconfigurable function unit with a processor. GARP [14], Ghimaera [13], 
and One-Ghip [34] are more recent examples of such work. The main difference 
between these systems and PipeRench is that PipeRench supports virtual hard- 
ware, freeing the application designer from fixed hardware constraints. 

7 Conclusions 

The use of reconfigurable hardware in cryptographic systems has many advan- 
tages. Reconfigurable implementations benefit from the hardware-based perfor- 
mance of custom VLSI while maintaining the flexibility and adaptability of soft- 
ware. Unlike fixed hardware, reconfigurable devices deliver highly customized, 
efficient solutions that are adaptable and robust to changing system needs. 

The PipeRench reconfigurable architecture is well suited to many crypto- 
graphic tasks. Because it supports hardware virtualization, it can implement 
designs which are larger than the amount of physical hardware available. This 
is important, as many of the algorithms which can be mapped entirely to re- 
configurable devices require tremendous amounts of physical hardware. Some 
algorithms which cannot be mapped completely onto PipeRench can still be ac- 
celerated by building custom instructions. PipeRench has two drawbacks; both 
relating primarily to table lookups. For small tables, such as the F tables in 
Grvpton, the 8-bit PEs are very inefficient. For larger tables, such as S-boxes 
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found in many of the algorithms, PipeRench does not have any facility for per- 
forming memory accesses. We explore an extension to PipeRench, PipeRenchT, 
which overcomes this second drawback. 

Finally, key-specific circuits for PipeRench can be generated in embedded 
systems: with the use of a simple table lookup operation, the full performance of 
PipeRench can be obtained without waiting for software-based compilation. In 
the case of IDEA, we are able to exceed the performance of custom hardware. 
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Abstract. The CryptoBooster is a modular and reconfigurable crypto- 
graphic coprocessor that takes full advantage of current high-performance 
reconfigurable circuits (FPGAs) and their partial reconfigurability. The 
CryptoBooster works as a coprocessor with a host system in order to 
accelerate cryptographic operations. A series of cryptographic modules 
for different encryption algorithms are planned. The first module we im- 
plemented is IDEACore, an encryption core for the International Data 
Encryption Algorithm {IDEA^^). 

Keywords: Cryptography, Coprocessor, Reconfiguration, FPGA, IDEA. 



1 Introduction 

In this paper we describe a novel cryptographic coprocessor, the CryptoBooster, opti- 
mized for reconfigurable computing devices (e.g.. Field Programmable Gate Arrays, or 
FPGAs). Our implementation is modular, scalable, and helps to resolve the trade-off 
between device size and data throughput. The implementation is designed to support 
a large number of different encryption algorithms and includes appropriate session 
management. 

The first CryptoBooster implementation we propose implements a 

symmetric-key block cipher algorithm [7]. IDEACore is the first of a series of cryp- 
tographic modules for the CryptoBooster coprocessor. A simple reconfiguration of the 
reconfigurable computing device will suffice to replace IDEACore by another block ci- 
pher module (e.g., DES). The other modules of the CryptoBooster generally remain 
unchanged. 

^ IDEA is patented in Europe and the United States [11,12]. The patent is held by 
Ascom Systec Ltd., http://www.ascom.ch/systec. 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 246-256, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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In section 2 we describe the CryptoBooster architecture. Section 3 gives a short 
introduction to the IDEA^^ encryption algorithm and reviews existing implementati- 
ons of this algorithm in hardware. Section 4 describes the implementation of IDEACore, 
the first cryptographic module for the CryptoBooster. We conclude in section 5 the 
paper with the synthesis and performance results of our implementation. 



2 CryptoBooster 

CryptoBooster is a modular coprocessor dedicated to cryptography. It is designed to 
be implemented in Field Programmable Gate Arrays (FPGAs) and to take advantage 
of their partial reconfiguration features. The CryptoBooster works as a coprocessor 
with a host system in order to accelerate cryptographic operations. It is connected to a 
session memory responsible for storing session information (Figure 1). A session is cha- 
racterized by a set of parameters describing the cryptographic packets, the algorithm 
used, the key(s), the initial vector(s) for block chaining, and other information. 




Fig. 1. The CryptoBooster works as a coprocessor together with a host system. Typi- 
cally, the host system is a PC. The CryptoBooster needs additional memory to store 
session information. 



Our design is motivated by the following objectives: (1) The main goal is to have ma- 
ximum data throughput to provide a design able to cope with ever-increasing network 
speed. This justifies hardware implementation in place of a software solution. Physical 
security may, however, also be an argument for hardware implementation. (2) Our re- 
quirements include the ability to easily configure different algorithm sub-blocks and the 
associated subkey generation. We clearly need a highly modular architecture, allowing 
us to easily change building blocks. The modularity has been pushed far enough to 
allow partial reconfiguration of the coprocessor. Partial reconfiguration allows to offer 
several algorithms for one and the same physical chip with limited resources. 



2.1 Modular Architecture 

A block diagram of the CryptoBooster architecture is shown in Figure 2. The Interfa- 
ceAdapter module is a technology-dependent interface to the host system. Typically, 
this is a PCI or a VME interface, but one may also imagine a networking interface 
like Ethernet. The Hostinterface is the software interface to the host system. It offers 
read/write registers and interruptions to configure and control the coprocessor. The 
SessionMem module allows to interface different types and configurations of physical 
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memories. A separate session memory has been chosen mainly to limit the communi- 
cation between host and coprocessor and in order to change rapidly between different 
sessions. 

The CryptoCore module itself is subdivided into three parts: 

— CypherCore: encryption algorithm, 

— Session Adapter, session parameter management (specific to each CypherCore), 

— SessionControl: central controller for session management. 

The CypherCore and SessionAdapter modules are intelligent modules and can be 
queried by the SessionControl module. They respond with the implemented features 
available, ft is therefore possible to exchange these modules without changing the con- 
trol mechanism. All these modules communicate together and with the modules outside 
the CryptoCore using unidirectional point-to-point links called CoreLink. These links 
are designed to transmit control or data packets. This homogeneity at the interconnec- 
tion level strongly enforces the modularity of the system. 




to host system 



Fig. 2. Block diagram of the CryptoBooster architecture. The architecture is highly 
modular and uses standardized interconnections. 



2.2 Advantages of an FPGA-Based Implementation and 
Reconfiguration Features 

An FPGA circuit is an array of logic cells placed in an infrastructure of interconnections 
[14]. Each cell is a universal function or a functionally complete logic device, which can 
be programmed to realize a certain function. Interconnections between the cells are 
also programmable. The versatility allowed by logic blocks and the flexibility of the 
interconnections provide high freedom of design during the utilization of FPGAs. 

The GryptoBooster is implemented using the VHDL hardware description langu- 
age. The design can thus be synthesized without major problems for FPGAs as well 
as for VLSI technology. A VLSf solution results in general in higher performance than 
an FPGA implementation but the latter has several important advantages: (1) Recon- 
figurability of the FPGA allows the developer to easily provide specific solutions to 
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the customer as it is often needed in cryptography; (2) A VLSI multi- algorithm co- 
processor requires all corresponding CypherCores to be implemented in the chip which 
demands a huge amount of transistors; (3) On the other hand, an FPGA only contains 
one encryption algorithm at a given time. Other algorithms are available in the form 
of a configuration bitstream. Thus the maximum area required corresponds to the area 
used by the largest algorithm. 

One can distinguish between full and partial FPGA reconfiguration. Full reconfigu- 
ration is the common method currently used. The configuration is replaced by a new 
one each time the algorithm is changed (Figure 3). Partial reconfiguration allows to 
reconfigure parts of the FPGA, i.e., only the part where the algorithm is implemented 
on the FPGA has to be reconfigured. This normally results in a much shorter interrupt 
of service compared to full reconfiguration. The CryptoBooster is designed to take full 
advantage of partial reconfiguration. 




Fig. 3. Full reconfiguration requires to reprogram all the chip including common parts. 



CypherCore 


Session Mem 


Session 

Adapter 


SessionControl 


Hostinterface 


InterfaceAdapter 




Fig. 4. Partial reconfiguration requires only to reprogram the parts specific to a par- 
ticular algorithm. 



3 The IDEA^^ Block Encryption Algorithm 

The International Data Encryption Algorithm (IDEA^^) was developed by Xuejia Lai 
and James Massey of the Swiss Federal Institute of Technology in Zurich. The original 
version — called PES (Proposed Encryption Standard) — was first published in 1990 [8]. 
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The following year, after Biham and Shamir’s demonstrated differential cryptanalysis, 
the authors strengthened their cipher against the attack and called the new algorithm 
IPES (Improved Proposed Encryption Standard) [9]. IPES changed its name to IDEA 
in 1992 [7]. 

IDEA^^ is one of a number of conventional encryption algorithms that have been 
proposed in recent years to replace DES. However, there has been no rush to adopt 
it as a replacement to DES, partly because it is patented and must be licensed for 
commercial applications, and partly because people are still waiting to see how well 
the algorithm fares during the upcoming years of cryptanalysis. 

IDEA^^ is a 64-bit block cipher that uses a 128-bit key to encrypt data (DES also 
uses 64-bit blocks but only a 56-bit key). The same algorithm is used for encryption 
and decryption. It consists of 8 computationally identical rounds followed by an output 
transformation. Round r uses six 16-bit subkeys 1 j i j 6 , to transform a 64-bit 

input X = (Xi,X 2 ,X 3 ,X 4 ) into an output of four 16-bit blocks, which are input to 
the next round. The round 8 output enters the output transformation, employing four 
additional subkeys , 1 i i i 4 to produce the final ciphertext Y = (Yi , Y 2 , Y 3 , Y 4 ). All 
52 16-bit subkeys are derived from the 128-bit master key K. The key is long enough 
to withstand from exhaustive key searches well into the future. 

IDEA^^ is easy to implement in both software and hardware, even in embedded 
systems. A typical software implementation of IDEA^^ in C (using the Ascom Sy- 
stec source code) performs data encryption at 16 Mbit/s on a PentiumPro 180 MHz 
machine. It is clear that a hardware implementation of the algorithm may essentially 
speed up the throughput. To the best of our knowledge, the first VLSI implementation 
was developed at the Swiss Eederal Institute of Technology in Zurich by Bonnenberg 
et al [1] and reached a throughput of 44 Mbit/s at 25 MHz. 

Current hardware implementations have stressed the importance of the combina- 
torial delay and area consumption of the multiplication modulo ( 2 ^® + 1 ) units which 
are crucial to the entire system. These units are the limiting factor to obtain high data 
throughput. Various methods of implementing such a multiplication are investigated 
in [16,5, 15, 10,3]. The VINCI implementation uses a modified Booth recording multi- 
plication and fast carry select additions for the final modulo correction [17]. In general, 
for larger words, ROM-based solutions using lookup-tables require large ROMs. In a re- 
cent paper, Zimmermann [16] presents an efficient VLSI implementation of the modulo 
( 2 ^® + 1 ) multiplication. Several different hardware implementations of the IDEA^^ 
algorithm are given below. 



3.1 The VINCI VLSI— Implementation 

A new VLSI implementation — called VINCI [4, 17] — was developed in 1993 at ETH Zu- 
rich. The chip consists of 250,000 transistors on a total area of 107. 8mm^ and attained 
177.8 Mbit/s @ 25 MHz. The data path was optimized using an eight-stage pipeline 
and full-custom modulo ( 2 ^® + 1 ) multipliers. The VINCI chip was the first chip that 
could be used for on-line encryption in high-speed networking like ATM or EDDI. Ei- 
gure 5 shows the pipeline of the VINCI data path for one round. The multiplication 
modulo ( 2 ^® + 1 ) operation is distributed in two pipeline stages to reduce the critical 
path. 

Note that the output round is identical to the first section (3 stages) of a regular 
round as shown in Eigure 5. 
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Fig. 5. Pipeline of the VINCI [17] data path for one round. 



3.2 Ascom IDEACrypt Coprocessor 

Ascom Systec Ltd., holder of the IDEA^^ patents, offers a high-speed implementa- 
tion of the IDEA^^ algorithm as an embedded ASIC core. The IDEACrypt kernel 
implements data encryption and decryption in all common operating modes for block 
ciphers (ECB, CBC, CEB, OFB) [6]. IDEACrypt provides flexible key management 
for both master-session key and asymmetric key and stores the keys in a RAM which 
may be either on-chip or off-chip. Therefore, with a RAM, a bus interface, and a global 
controller, a complete IDEA^^ cipher may be implemented. The entire IDEACrypt 
coprocessor is implemented in synthesizable VHDL code. Ascom Systec lists a comple- 
xity of approximatively 35k gates. 

With 0.25 micron technology using a 3-stage pipeline, the chip provides a throug- 
hput of 300 Mbit/sec (@ 40 MHz) in ECB mode and a throughput of 100 Mbit/sec in 
the other modes. At 100 MHz, the throughput goes up to 720 Mbit/sec in ECB mode. 



3.3 Existing FPGA-Implementations 

Caspi and Weaver [2] proposed IDEA as a benchmark for reconfigurable computing. 
Their implementation on Xilinx XC4005 achieves a throughput of 0.477 Mbit/s. A high 
space-conserving design was used because of the limited resources of this FPGA. 

Mencer et aL [13] compared different implementations of the IDEA^^ algorithm. 
They compared a Digital Signal Processor (DSP) from Texas Instruments to Xilinx 
XC4000 series FPGAs and pointed out the benefits and limitations of FPGA and DSP 
technologies for IDEA^^ . The FPGA implementation has a throughput rate of 528 
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Mbit/s @ 33 MHz using a fully pipelined version of IDEA^^ distributed on four 
XC4000XL FPGAs. Taking into account the powerful programming capabilities of the 
FPGA technology and the performance of the new families (e.g., Xilinx Virtex), this 
kind of circuit is becoming an excellent option to implement IDEA^^ . 



4 The First CypherCore: IDEACore Encryption Module 



The first CypherCore module — called IDEACore — implements the IDEA^^ algo- 
rithm. It is composed of a scalable pipeline and an associated block chaining module 
(Figure 6). 




Fig. 6. The IDEACore module is composed of a scalable pipeline and a block-chaining 
module. 



4.1 A Scalable IDEA^^ Pipeline 

We adopted a highly scalable solution for our IDEA^^ pipeline: the length of the 
pipeline can be chosen at compilation time. The regular round is inspired by the VINCI 
datapath (see Figure 5). Figure 7 shows the regular round consisting of seven pipeline 
stages. The first three stages of a regular round simply form the output round (Fig. 8). 

As Figure 9 shows, the minimal pipeline length is one regular round followed by 
one output round. Data has to be fed eight times through the regular round before 
passing through the output round. The longest pipeline (full-length pipeline) consists 
of eight rounds and one output round. In each configuration, the data needs 59 clock 
cycles to pass through the pipeline. A longer pipeline has a smaller latency and thus a 
higher throughput. 

We use a fully self-controlled pipeline with a control pipeline associated in parallel 
to the data pipeline. The control pipeline addresses the key memories attached to each 
stage that needs an encryption key. Every 64-bit data block has an associated counter 
that indicates the current round. Data is automatically feed to the output stage if 
it was sent the correct number of times through the regular rounds or it is fed back 
through the block of regular rounds. Pipeline bubbles (data marked by a non- valid bit) 
are automatically inserted into the pipeline if the module preceding the pipeline (block 
chaining) is not able to deliver new data packets. This mechanism allows us to avoid 
pipeline stalls. 
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Multipliers and Modulo (2^® + 1) We currently use simple bit-parallel multi- 
pliers optimized for FPGAs and the low-high algorithm [7] for the modulo (2^® + 1) 
calculation. As stated in section 3, the combinatorial delay and area consumption of 
the multiplication modulo (2^® + 1) units are crucial to the entire design and are limit 
the data path. 

Bit-parallel multipliers are perhaps not the best choice, but we were surprised by the 
performance they achieved and the area they used in our FPGA-based implementation. 
In the near future, we intend to optimize the multiplication modulo (2^® + 1) units to 
achieve yet higher throughput. 
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Fig. 7. One round with associated encryption key memories of the GryptoBooster 
pipeline. 




Fig. 8. Output round with associated encryption key memories of the GryptoBooster 
pipeline. 



4.2 Block-Chaining 

The block-chaining module implements the commonly used block-chaining algorithms 
like EGB (Electronic Godebook Mode), GBG (Gipher Block Ghaining Mode), GEB 
(Cipher Eeedback Mode), and OEB (Output Eeedback Mode). To prevent the pipeline 
from stalling, the block-chaining module always disposes of enough initial vectors (the 
number depends on the number of regular rounds used in the pipeline). 

As with all other modules in the CryptoBooster architecture, the block-chaining 
module is connected to the other modules by CoreLink unidirectional point-to-point 
interconnections. 
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Fig. 9. The four different pipeline configurations: 1+1, 2+1, 4+1, and 8+1 rounds. 
Data has to be feed 8, 4, 2, and 1 time through the regular rounds. 



4.3 Performance of the IDEACore CryptoCore 

The CryptoBooster is designed to achieve maximum throughput for a given area in 
the FPGA. Our current implementation allows pipeline lengths of 1, 2, 4, or 8 regular 
rounds, followed by one output round. A full-length pipeline consists of 59 (8 regular 
rounds + 1 output round) stages with a latency of 1 clock cycle when using bit-parallel 
multipliers (Figure 9). 

The peak performance of our current implementation is estimated at 200 Mbit/s 
for a 1-round pipeline (1 regular round + 1 output round) and it easily fits into a 
state-of-the-art FPGA. Performance for a full-length pipeline (8 regular rounds + 1 
output round) is estimated at more than 1500 Mbit/s. However, the area needed in 
terms of reconfigurable logic blocks in the FPGA is quite important. 

Session initialization and key calculation slightly decrease the overall performance 
over a complete session. Depending on the block-chaining mode used, performance may, 
however, significantly decrease. 



5 Conclusions 

The CryptoBooster is a modular and reconfigurable cryptographic coprocessor taking 
full advantage of current high-performance reconfigurable circuits (FPGAs). Reconfi- 
gurable circuits can be reconfigured within a few milliseconds and they provide speed 
rates close to ASIC designs. Our main goal is to have maximum data throughput so as 
to provide a design able to cope with ever-increasing network speed. This justifies hard- 
ware implementation in place of a software solution. Physical security may, however, 
also be an argument for hardware implementation. 

As our results show, the throughput of the CryptoBooster allows the coprocessor to 
be used in today’s high-speed networks like ATM, Sonet, and GigaEthernet; moreover, 
it is competitive with full-custom circuits or DSP implementations. IDEA^^ was 
chosen as a first CryptoCore module. More modules with different algorithms (e.g., 
DES) are planned. 
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Abstract. A compact fast elliptic curve scalar multiplier with varia- 
ble key size is implemented as a coprocessor with a Xilinx FPGA. This 
implementation utilizes the internal SRAM/registers of the FPGA and 
has the whole scalar multiplier implemented within a single FPGA chip. 
The compact design helps reduce the overhead and limitations associa- 
ted with data transfer between FPGA and host, and thus leads to high 
performance. The experimental data from the mappings over small fields 
shows that the carefully constructed hardware architecture is regular and 
has high CLB utilization. 

Keywords, elliptic curves, pub lie- key cryptography, scalar multiplica- 
tion, Galois field, reconfigurable hardware, FPGA, coprocessor 



1 Introduction 

The motivation of this work is to develop high-speed elliptic curve scalar multi- 
pliers with the least development time, the lowest hardware cost and maximal 
flexibility. Recently, elliptic curve (EC) cryptosystems have become attractive 
due to their small key sizes and varieties of choices of the curves available. Their 
low cost and compact size are critical to some applications, including smart 
cards and hand-held devices [1]. In all those applications, an elliptic curve scalar 
multiplier serves as a basic building block for secret key exchange, authentica- 
tion and certifleation. Most microprocessors have hardware-supported integer 
multiplication and logic functions like AND, OR or XOR, so an elliptic curve 
cryptosystem can be implemented on them. However, this is not efficient because 
of word size mismatch, less parallel computation, no hardware supported wire 
permutation and algorithm/architecture mismatch. As a result, such systems 
have low performance/cost ratios. 

The solution to this problem is to build a coprocessor as a dedicated com- 
puting unit. Moreover, using an FPGA, the coprocessor can be reconflgured for 
different application instances or for different computation stages of one particu- 
lar application. Thus, the total hardware utilization can be kept at a very high 
rate and the computation is speeded up. 

Previous work in this area is based on a coprocessor for arithmetic operations 
over GF(2^^^) using a gate array [11]. To accomplish an elliptic curve operation. 
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the host controller and the coprocessor have to transfer data between each other 
frequently. The control of elliptic curve operations and the storage of interme- 
diate variables are provided by the host controller. Therefore, the communication 
cost is large and may be a bottleneck for an elliptic curve cryptosystem. 

In this paper, a compact fast elliptic curve scalar multiplier coprocessor is 
introduced which utilizes the internal SRAM/registers in an FPGA and is im- 
plemented within a single FPGA chip. The normal bases for the underlying 
finite field are chosen because the field squarings can be done with the bit shifts 
in hardware and are virtually free[2]. A pipelined digit-serial modified Massey- 
Omura multiplier is constructed and is used in the design. The scalar multiplier is 
implemented with a parameterized (in term of key size) VHDL description and is 
synthesized/mapped to a Xilinx FPGA. By changing the parameter for key size 
and re-doing synthesis, a different instance can be acquired. The architecture 
and algorithms adopted here are suitable for massively parallel computation. 
Therefore, with larger capacity FPGA chips, higher performance can be easily 
obtained with few changes in the underlying design. 

2 Algorithm of EC Scalar Multiplier 

The basic operation in an EC cryptosystem is the scalar multiplication over the 
elliptic curve and the most efficient method for computing EC scalar multipli- 
cations is to use an addition/subtraction method[2][4] [5]. With this method, the 
scalar (or the integer) is decomposed in a non- adjacent format (NAE) and one 
scalar multiplication is done with a series of additions/subtractions of elliptic 
curve points. In turn, each addition/subtraction of EC points consists of a series 
of underlying field additions, squarings, multiplications and inversions. When 
the elliptic curve is defined over GF{2'^) with an optimal normal basis, these 
underlying field operations have the least complexity. The elliptic curve used in 
this implementation is defined by Weierstrass equations as: 

y‘^ F xy = F ax‘^ + b (1) 

where a, 6 G GF{2'^) and b ^ 0. The algorithms of EC scalar multiplication and 
EC addition/subtraction are shown below respectively [2]. 

Algorithm 1: EC scalar multiplication 
Input: 

F—EC point 

n— (e^_i, e^_ 2 , ei, cq) non-adjacent format integer and e/_i=l 
Output: 

Q = nF 
Computation: 

EQ = F; 

2. Eor i = I — 2 downto 0 do 
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Set Q = 2Q; 

If = 1 then set Q = Q ^ F] 

It Ci = —1 then set Q = Q — F] 

3. Output Q] 

Algorithm 2: EC addition 
Input: 

a = {xo,yo) 

a = (xi,yi) 

Output: 

{x 2 ^y 2 ) = P 2 = PiFFo 
Computation: 

1. If Po = 0, then output F 2 = Pi and stop; 

2. If Pi = 0, then output P 2 = Po and stop; 

3. If xq = xi then 

If yo = yi then 

A = xi Fyi/xi; 

X2 — A^ “h A “h fl; 

Z /2 = ^1 + (A + 1)^2; 

else output O; 

else 

^ = (z/o + Z/i)/ (^0 + ^ 1 ); 

X2 = A^ + A + xq + ^1 + fl; 

Z/2 = {xi +X2)A + X2 +yi; 

4. Output(x2,y2); 

Since -Pq = (xo,xq + yo) for Pq = (xo,yo) and P^ - Pq = Pi + (-Po), 
EC subtraction is as simple as EC addition and can be computed with one 
EC addition. The average and maximal number of non-zero bits among NAEs 
are about m/3 and m/2, respectively [2]. Therefore the average cost of Alg.l is 
about m point doubles and m/3 point additions/subtract ions [2], and the worst 
case cost is about m point doubles and m/2 point additions/subtractions. This is 
much better than binary decomposition, which has m/2 non-zero bits on average 
and m non-zero bits in the worst case. 

3 Hardware Architecture 

3.1 System Structure 

Since the word size (or key size) for a typical elliptic curve cryptosystem is 
large, the above algorithm can not be unfolded. Therefore, a folded hardware 
architecture is constructed with a controller to sequence the computation. The 
underlying field multiplier, a GF{2'^) multiplier with optimal normal basis, can 
be implemented as either a serial multiplier or a digital-serial multiplier or a 
parallel multiplier, depending on the amount of available hardware resources. In 
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Fig. 1, the two FIFOs serve as input/output buffers and the dual-port register 
file is used to save input parameters and intermediate data. This is realiza- 
ble because Xilinx FPGAs have a large amount of internal SRAM and registers. 
Alternatively, if we were to use external SRAM, it would take either many exter- 
nal user pins with multiple SRAM chips, or multiple cycles to read in one single 
word. This would result in low bandwidth data transfers due to the large word 
size. The hardware provides GF(2^) arithmetic units GF_adder, GF_squarer, 
GF .multiplier and GF.inverter. With the finite field of characteristic 2 as the 
underlying field, addition is just a bit-wise XOR, and with normal bases repre- 
sentation, squaring is a simple cyclic right shift. The internal structures of the 
GF_multiplier and GF_inverter are given in the following sections. 




Fig. 1. Hardware architecture of FG coprocessor 



3.2 GF .multiplier Structure 

The structure of GF .multiplier is a modified form of the Massey- Omura mul- 
tiplier. Compared to the implementation in [9], the modified structure redu- 
ces the number of AND gates and the wire permutation by 50% in the AND 
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c(0:m-l) (after m clocks) 



Fig. 2. Structure of GF{2'^) serial multiplier 
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Fig. 3. Structure of GF{2'^) digit-serial multiplier 




Fig. 4. Structure of GF(2'^) parallel multiplier 



Plane without changing the total number of XOR gates. A serial multiplier of 
such a structure is shown in Fig. 2, which can be simply unfolded to a digital- 
serial multiplier in Fig. 3 or a parallel multiplier in Fig. 4. In this serial multi- 
plier, at each cycle, a_shift .register and b .shift .register make a cyclic right shift, 
and one bit of the product is computed and shifted into the product register, 
c.shift .register. Therefore, each multiplication takes m clock cycles. If the serial 
multiplier is unfolded to a parallel multiplier, then each multiplication only ta- 
kes one clock cycle. However, the required hardware will be m times that of the 
serial multiplier. Pipeline techniques are also applied to the XOR Plane — AND 
Plane — XOR Tree of the multiplier to reduce the clock cycle time. The modified 
Massey-Omura serial multiplier takes m AND gates, 2m XOR gates and 3m 
fiip-fiops, and has a latency of m x {Tand FTxoR\^og2{m— 1)]) when it is not 
pipelined. However when it is pipelined, the serial multiplier has a latency of 
(m+ \log 2 {m— 1)]) x Max{TAND /-I-' xor) and a total cost of m AND gates, 2m 
XOR gates and 5m fiip-fiops. It is obvious that the pipelined multiplier is much 
faster than non-pipelined ones when m is large. The same techniques can also 
apply to a digit-serial multiplier. The digit-serial multiplier with a digit size of 
k will generate k bits of the product simutaneously and thus one multiplication 
can be done in kth fold time taken by a serial multiplier. The digit-serial multi- 
plier makes a trade-off of the speed and the hardware between a serial multiplier 
and a parallel multiplier. 
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3.3 GFJnverter Structure 

The structure of GF .inverter is derived from the method introduced by T. Itoh 
et al[8] and is shown in Fig. 5. The inverse takes 1)J recursive iterations 

and a total of “ 1)J + — 1) — 2 underlying field multiplications, 

where H W {m — 1) is the Hamming weight of (m — 1). 



Input 




Fig. 5. Structure of GF{2'^) inverter 



3.4 Controller Structure 

The controller takes advantage of the abundance of internal SRAM and registers 
in Xilinx FPGAs. The controller is built up as a finite state machine with table 
look-up to implement the logic functions. Since the whole look-up table consists 
of small look-up tables from each CLB (configurable logic block), the controller 
can be pipelined to have a clock cycle time equal to one CLB delay. A pipelined 
structure of the controller is illustrated in Fig. 4. At each cycle, a selector asso- 
ciated with the value of PC is generated. It then selects the appropriate bit of 
the condition word to make PC either increment or load a new value from the 
Branch PC look-up. In case of a branch hazard, the pipeline register is cleared. 




Fig. 6. Pipelined structure of the controller 






Elliptic Curve Scalar Multiplier Design Using FPGAs 263 



4 Dataflow of EC Scalar Multiplication 

The algorithm and hardware architecture leads to the computation dataflow 
chart shown in Fig. 7 and Fig. 8. Each bit of the decomposed scalar is encoded 
by 2 bits to represent {1,0,-!}. A total of 9 intermediate storage elements are 
needed and it leads to a 4-bit address space for the dual-port register file. Then 
each data bit of the dual-port register file can be implemented with one CLB 
and the dual-port register file has one CLB delay. It is obvious that the dataflow 
has many conditional branches. Therefore, the branch hazard problem has to be 
taken care. From the dataflow chart, the schedule and control of computation in 
the scalar multiplier is worked out and the corresponding VHDL description is 
implemented. 




Fig. 7. Dataflow chart I of EC scalar multiplier 



5 Development of EC Scalar Multiplier 

All components in the scalar multiplier are implemented according to above 
algorithms and hardware architectures. A pipelined digit-serial multiplier is im- 
plemented because a serial or a parallel multiplier is only a special case for digit 
size of 1 or m respectively. The controller is a finite state machine according to 
Fig. 6. The implementation of the controller follows the dataflow charts I & II 
in Fig. 7 and Fig. 8, respectively. All operations in the dataflow can be catego- 
rized as one of the following atomic operations: unconditional jump, conditional 
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Fig. 8. Dataflow chart II of EC scalar multiplier 



jump, operand load, operand store, finite field addition, finite field squaring, 
finite field multiplication and finite field inversion. Then, each state of the con- 
troller consists of one or more such atomic operations because addition, squa- 
ring, multiplication/inversion and load/store can be executed concurrently. The 
execution schedule is optimized to provide the shortest computing time. These 
atomic operations are represented as macros and are re-used in the VHDL code. 
One example of the VHDL simulation results is shown in Fig. 9 for m = 39 and 
type II optimal normal basis[14]. The example shows two scalar multiplications 
which compute 

7 X {1A0C3EB323,2EE60CF558) = {7 327 E64F AD, 34 A9265C El) 

7 X {1A0C3EB323,34EA32467B) = {7327 E64F AD, 418EC0133C) 

for a = 1A28CE01DD and b = 1200569H44 in equation (1). The hex encoding 
follows the method in [14]. 

6 Mapping onto Xilinx XC4000XL-Series FPGA 

6.1 Synthesis/Mapping Results 

After VHDL code simulations, the design is setup with a pipelined GF{2^) 
serial multiplier (or digit size of 1) and is mapped onto a Xilinx XC4000XL- 
series FPGA for some small values of m. The synthesis is done with Exemplar 
synthesis tools and the mapping/ layout is done with Xilinx Design Manager. 
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Fig. 9. VHDL simulation waveform for m = 39 



The results are shown in Table 1 and one example of the layout for m = 29 
is shown in Fig. 10. The mapping efficiency is represented with the percentage 
of total CLBs in a FPGA that is used by the design and the throughput is 
represented with the scalar multiplications per second. Both the throughput 
and clock cycles in Table 1 represent the worst case performance. It is shown 
that the architecture of the prototype is regular and the designed coprocessor has 
very high CLB utilization. The dominant operation in the EC scalar multiplier 
is GF{2^) multiplication. Therefore, if the GF{2^) serial multiplier is unfolded 
with a factor of 2, then the throughput, in terms of scalar multiplications per 
second, will be doubled. 



Table 1. FPGA chip area utilization and throughput 



Value 
of m 


XC4000XL 

Device 


CLB Usage 


Clock Cycles 


Throughput 
(scalar mul/sec) 


5 


4010XL 


272/400 = 68% 


126 


179856 


11 


4013XL 


478/576 = 83% 


825 


19230 


29 


4028XL 


962/1024 = 93% 


7158 


1653 


53 


4044XL 


1626/1936 = 84% 


26753 


417 



6.2 Analysis of the Mapping Results 

Since the proposed hardware architecture is regular and simple, the expected 
mapping results can also be obtained with an estimation formula. Then, a com- 
parison can be made to analyze the mapping results. In order to derive the 
formula, two summaries are listed below: 
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Fig. 10. FPGA Layout for m=29 with XG4028XL 



1. CLB (Configurable Logic Block) structure of Xilinx XC4000XL-series: 

— 2 fiip-fiops(FFs) per CLB 

— 2 function generators (FGs) per CLB (4 input/single output logic unit) 

— 2 single-port 16x1 RAMs per CLB (using two logic units) 

— 1 dual-port 16x1 RAM per CLB (using two logic units) 

2. Cost of implementing basic components: 

— two m-bit registers take 2m FFs. 

— three m-bit shift registers take 3m FFs and FGs. 

— four m-bit 2:1 MUXs take 4m FGs. 

— one m-bit GF .Adder takes m FGs. 

— one m-bit dual-port 9 word RAM takes 2m FGs. 

— one m-bit 6 word FIFO takes 6(m + 1) FFs and FGs. 

— one m-bit 2 word FIFO takes 2(m + 1) FFs and FGs. 

— one m-bit GF_Multiplier takes 5mFFs and 3m FGs (pipelined bit-serial 
multiplier) . 

— one m-bit GFJnverter takes 3(m -h /e^ 2 ur)FFs and FGs (excluding GF .Multiplier). 



From above basic facts, the cost of one EC scalar multiplier with key size m is 
derived as: 

— Total FFs = 21m + 3/o^2^ + 48 

— Total FGs = 24m + 3 log 2 m + 308 

— Minimal value of Total CLBs 

= max(Total FFs, Total FGs)/2 
= 12m + ( 3 log 2 m) / 2 + 154 

— Maximal value of Total CLBs 
= (Total FFs + Total FGs) 

= 45m + 6 log 2 m + 356 

Table 2 is constructed with above estimation formula and the data in Table 1. 
In Table 2, the actual CLBs means the CLB count of the mapping obtained 
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with Xilinx Design Manager. The Min/Max CLBs comes from the estimation 
formulas and puts a lower/upper limit for the total number of CLBs needed 
for the design. Using regression, a polynomial curve of degree 2 is fitted with 
the data of the actual CLBs and the CLB usage for larger fields are predicted 
through extrapolation with the fitted curve. However routing resources are not 
taken into account in the extrapolation and they could become a bottleneck for 
larger fields. For m=160, the estimated number of CLBs 4000) is larger than 
the capacity of the XC4085XL. However, the design should easily fit onto the 
Virtex XCVIOOO chip. 




Value 
of m 


Expected 

Device 


Actual 

CLBs 


Min 

CLBs 


Max 

CLBs 


5 


XC4010XL 


272 


219 


600 


11 


XC4013XL 


478 


292 


876 


29 


XC4028XL 


962 


510 


1692 


53 


XC4044XL 


1626 


799 


2778 



Table 2. Mapping analysis Fig. 11. Extrapolated mapping analysis 



7 Conclusions and Future Work 

The experimental results from the mappings over small fields show that the hard- 
ware architecture is regular and achieves high CLB utilization and high speed. 
The use of an FPGA in the development of an elliptic curve scalar multiplier 
demonstrates many advantages: 

— Reduced development time and cost. 

— Tailorable design for a particular application. 

— Reduced hardware overhead and high performance. 

— Increased chip area utilization. 

— Hardware performance with advantages of software development. 

— Simplified hardware architecture and ability to easily add new functions. 

The effectiveness and eventual performance/cost ratio of applying reconfigu- 
rable hardware to cryptography depends on many factors and research in this 
area is highly experimental. Therefore, future work remains in many areas and 
is summarized as follows: 

— To map the design onto more types of FPGA chips to show the usefulness 
of the design and to reveal the relationship between the algorithm and the 
architecture /resources of FPGAs. 
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— To build some EC application systems, such as an EC digital signature or an 
EC Difhe- Heilman key exchangefl] [14] [15] , by using reconfigurable hardware, 
such that a direct comparison can be made with other implementations using 
microprocessors [12] [13]. 
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Abstract. In this article, an extremely simple and highly regular ar- 
chitecture for finite field multiplier using redundant basis is presented, 
where redundant basis is a new basis taking advantage of the elegant 
multiplicative structure of the set of primitive roots of unity over 
F 2 that forms a basis of over F 2 . The architecture has an impor- 
tant feature of implementation complexity trade-off which enables the 
multiplier to be implemented in a partial parallel fashion. The squaring 
operation using the redundant basis is simply a permutation of the coef- 
ficients. We also show that with redundant basis the inversion problem 
is equivalent to solving a set of linear equations with a circulant matrix. 
The basis appear to be suitable for hardware implementation of elliptic 
curve cryptosystems. 



1 Introduction 

Efficient computations in finite field and their architectures are very important 
to many cryptosystems, e.g.^ elliptic curve systems. There are mainly three types 
of bases over finite fields, namely, polynomial basis (PB), normal basis (NB), and 
dual basis (DB)[12], which are commonly used to represent the field elements. 
The main advantage of using the normal basis is that the squaring operation in 
NB is simply a cyclic shift of the coordinates of the element, and thus this basis 
has found application in computing exponentiations and multiplicative inver- 
ses [10,8,1]. However, the computations of exponentiations and inverses require 
not only squaring but also multiplications. Massey and Omura devised an NB 
multiplier known as Massey-Omura multiplier [13]. Alternative bit-serial mul- 
tiplications using the normal basis can be found in [5,2]. The bit-parallel NB 
multipliers were proposed in [17,9]. PB and DB have been also used for imple- 
menting bit-parallel multiplier [14,8,16,6,11,19]. 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 269-279, 1999. 
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In this article a new basis - redundant basis (RB), is proposed. The redun- 
dant basis takes advantage of the elegant multiplicative structure of the set of 
primitive roots of unity over F2 that forms a basis of F2m over F2. It is shown 
that finite field arithmetic operations using the redundant basis have extremely 
simple and highly regular structures. 

Some similar work to ours using polynomial ring basis was proposed re- 
cetly [ 4 ]. We believe that the polynomial ring basis is a subset of the redundant 
basis proposed here. 

The organization of this paper is as follows: Redundant basis is introduced 
in Section 2 . In Section 3 , multiplication operation using RB is discussed and 
then architectures of bit-serial and bit-parallel multipliers are proposed. Relation 
between RB and other types of bases is analyzed in Section 4 . Squaring and 
inverse operation using RB are discussed in Section 5 and Section 6, respectively. 
Finally, a few examples are given in Section 7 . 



2 Redundant Basis 



Definition 1. [12] Let K he a field of characteristic p and n he a positive integer. 
The splitting field of x'^ — 1 over a field K is called the cyclotomic field over 
K and denoted hy The roots of x'^ — 1 in are called the roots of 

unity over K and the set of all these roots is denoted hy . Then a generator 
of the cyclic group E^^^ is called a primitive root of unity over K if n is not 
divisible by p. 

Redundant basis uses the set of primitive roots of unity over F2 that 
forms a basis of F2m over F2. Let /? be a primitive root of unity in F2m or 
some extension field of F2m, then we have 






i 7^ n — 1, 

1 = X i = n-l. 



By adding the element to the set of primitive roots of unity, we have^ 

(1, = ii, which can be used as a basis in F2m over F2 and 
it is referred to as a redundant basis. Note that the base elements are in the 
cyclotomic field F2^^^^ and they may not belong to the field F2m . Clearly, any field 
F2m has a redundant basis if there is a cyclotomic field over F2 that contains 
F2m as a subfield. Thus one redundant basis can be the set of (2"^ — l)st roots 
of unity. To efficiently represent the field elements, the redundant basis should 
be chosen such that its size is as small as possible. Now the question is: Given 
F2m, what is the smallest cyclotomic field F2^^^^ that contains F2m as a subfield? 
An algorithm for computing such an n is given below. 

Algorithm 1 Computing the smallest cyclotomic field F2^^^^ that includes ¥2m 
as a subfield 

We denote a set by {• • • } and an orderly set by (•••). 



1 
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1. Find all the factors di of 2"^ — 1 that are greater than m and list them in an 

increasing order: di, ^ 2 , • • • , = 2 "^ — 1 ; 

2. DO WHILE(i ^ k) 

IF m I (f){di) AND j = m is the smallest integer such that 2^ = 1 mod di^ 
THEN t^di, and BREAK; ELSE 

3. Let t and let h be the largest positive integer such that t > hm. 

IF h>l THEN 

EOR i = 2 TO DO 

a) Eind all the factors di of 2^'^ — 1 that are greater than im and 

list them in an increasing order: di, ^ 2 , • • • = 2 *"^ — 1 ; 

b) DO WHILE(i ^ ki) 

IE im I 4>{di) AND j = im is the smallest integer such that 
2^ = 1 mod di^ THEN n ^ min{n, and BREAK; ELSE 

i ^ i + 1. 

□ 

Since the ( 2 "^ — 1 )®^ cyclotomic field has a degree of (f>{2'^ — 1 ) and contains the 
field F 2 m as a subfield, we have that m divides (j){2'^ — 1), where 0 is the Euler 
Phi function. 

3 Redundant Basis Multiplier 

3.1 Multiplication Operation 

Consider the redundant basis in F 2 m over F 2 : ii = (1,/?,/?^, . . . Let field 

element A G F 2 m and be represented with li\ 

A = Uq + (l\P + 0^2p‘^ • • • + — ^7 ( 1 ) 

where G F 2 , i = 0, 1, . . . , n — 1 . Note that n > m A 1 and the set of coefficients 

{ai} is not unique. 

Now let us look at multiplication operation under the redundant basis ii. 
Let B G F 2 m be given as = 6 q A bifd A 62 /?^ A • • • A Then we have 

f3 • B = bofd A bi/3‘^ A ^ 2 /?^ A • • • A 6 ^- 2 /?’^ ^ A bn-i/3^ 

= ^n-i + bQfS A bif3‘^ A b20^ A • • • A bj^_2p'^ ^ • 

Obviously, the coordinates of f3B is a cyclic shift of those of B^ with respect to 
ii. From 

- B = 60 ,^* — bip^^^ A — ■ ■ ■ A bn-i-iP^^ ^ A b-n-i A bn-i+ip A ■ ■ ■ A hn-i/3^ ^ 

= bn-i A bn-i-\-2p A • ■ • ^ bn-lP^ ^ A b()/3^ A bip^^^ A ■ ■ ■ A ^ 

j-0 



( 2 ) 
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where {j — i) = {j — i) mod n denotes that j — i is to be reduced modulo n, we 
have 

n — 1 n — 1 n — 1 n — 1 n— 1 

a-b = y , • ^) = E E = E (E 

2=0 2=0 i=o i=o 2=0 



n — 1 

If we define = C = Cj (3 ^ , then it follows 

j=0 
n — 1 

= j =0,1,- 1- (2) 

2 = 0 

3.2 Bit-Serial Multiplier Architecture 

Figure 1 shows the multiplier structure to realize multiplication using redundant 
basis. The coordinates of B with respect to the redundant basis 1\ are loaded into 
a register of length n bits whose contents can be shifted cyclically. The binary tree 
of n — 1 adders in F 2 takes n terms of aib^ as its inputs and generates a cj term 
as output every clock cycle. All c^’s, j = 0, 1, . . . , n — 1, which are represented 
using ii, are computed and obtained in n clock cycles. It can be seen that n AND 
gates, n — 1 XOR gates and a 1-bit registers are required for constructing the 
multiplier. The clock period should not be less than Ta + Tlog 2 where Ta 

and Tx denote the time delays of an AND gate and an XOR gate, respectively. 

Table 1 shows the complexity of the bit-serial multipliers using redundant 
basis and normal basis when there is a type I optimal normal basis. While 
Table 2 shows comparison of the complexities between RB multiplier and NB 
multiplier when there is a type II optimal normal basis or no optimal basis. 



Table 1. Comparison of bit-serial multipliers using type I ONB and RB (here n = 
m + 1). 



Multiplier 


#AND 


#XOR 


#l-bit reg. 


# elk cycles 


Basis 


Massey-Omura [13] 


2m — 1 


2m — 2 


2m 


m 


normal 


Feng [5] 


2m — 1 


3m — 2 


3m — 2 


m 


normal 


Agnew et al [2] 


m 


2m — 1 


3m 


m 


normal 


presented here 


m + 1 


m 


m + 1 


m + 1 


redundant 



It can be seen that the bit-serial redundant basis multiplier has lower comple- 
xity only when there is a type I optimal basis. When there is a type II optimal 
NB or no optimal basis, then the redundant basis multiplier will have a long 
time delay. In this case, partially parallel architecture can be employed and it is 
discussed in the next section. 
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Table 2. Comparison of bit-serial multipliers using NB and RB (where n = km + 1). 



Multipliers 


#AND 


#XOR 


#l-bit reg. 


# elk cycles 


basis 


Massey-Omura [13] 


Cn 


Cn — 1 


2m 


m 


normal 


Feng [5] 


2m — 2 


Cn T m — 1 


3m — 2 


m 


normal 


Agnew et al [2] 


m 


Cn 


3m 


m 


normal 


presented here 


km + 1 


km 


km + 1 


km + 1 


redundant 



In the example presented in [5] , a technique of reusing partial sum 
was used to reduce the complexity. Thus the number of XOR gates 
should be not greater than CAr + ui — lifa non-optimal normal 
basis is used. 




Fig. 1. Bit serial multiplier using the redundant basis. 



3.3 Bit-Parallel Multiplier Architecture 

A parallel version of the multiplier using a redundant basis is shown in Figure 2. 
On the left side of the figure inputs {ai} and {6^} are fed into n blocks (Block 
B). The detailed structure of Block B is shown on the right side of the figure. It 
can be seen that 'n? AND gates and n(n — 1) XOR gates are required. The time 
delay is 7 a + \\og 2 ^\^x- 



Trade-off with complexities or partial-parallel architecture The proposed bit- 
parallel architecture can be easily made for trade-offs between size and time 
complexities: If t Block B’s are used to construct a multiplier and thus in one 
clock cycle t c^’s are computed and output, then one multiplication operation 



can be completed in 



clock cycles. This feature has great significance for 



hardware implementation since it might be difficult to implement a full-scale 
bit-parallel multiplier in hardware if the field is very large. 
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Fig. 2. Parallelization of the bit-serial multiplier using the redundant basis. 



Table 3 adn Table 4 show the comparisons between bit-parallel redundant 
basis multipliers and bit-parallel normal basis multipliers. 



Table 3. Comparison of bit-parallel multipliers using type I ONB and RB, here n = 
m + 1. 



Multipliers 


#AND 


#XOR 


Time delay 


Partial-parallel Arch. 


Hasan et al [9] 




-1 


Ta + (1 + [log 2 m~\)Tx 


not avail. 


Koc and Sunar [11] 




-1 


Ta + (2 -h [log 2 m~\)Tx 


not avail. 


New proposal 


(m + 1)^ 


m{m + 1) 


Ta + [log 2 (m -h l)]Tx 


available 



Table 4. Comparison of bit-parallel multipliers using type II ONB and RB (where 
n = 2m + 1). 



Multipliers 


#AND 


#XOR 


Time delay 


Partial-parallel Arch. 


Massey-Omura 


2m^ — m 


2m^ — m 


Ta + riog 2 ( 2 m - l)]Tx 


available 


New proposal 




2m^ — m 


Ta + (1 + [log 2 m~\)Tx 


available 



3.4 Complexity 

Clearly the complexity of the RB multipliers in F 2 m over F 2 depends on the size 
n of the cyclotomic field F^^^\ There seems no easy way to give a general relation 
between n and m. In Table 1, we have computed values of n for certain small 
values of m using Algorithm 1. 
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Table 5. Smallest cyclotomic field F that includes ¥ 2 ^ as a subfield. 



m 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


18 


19 


n 


3 


7 


5 


11 


9 


29 


17 


19 


11 


23 


13 


53 


29 


31 


19 


37 



For a subset of redundant basis, which can be derived from certain normal 
basis (optimal normal basis), the complexity can be easily solved which is dis- 
cussed in the next section. Also, for the field in which there exists an equally 
spaced polynomial (ESP), a small value of n can be found. 

4 Relation/Conversion between Redundant Basis and 
Other Bases 

4.1 Redundant Basis and Normal Basis 

Some redundant bases can be easily introduced by the normal basis genera- 
ted with Gauss period, which also reveals the relation/conversion between the 
redundant basis and the normal basis. 

Gauss period, normal basis and redundant basis The Gauss period (GP) was 
discovered by Gauss and is defined as follows: Let m,k ^ 1 he integers such that 
r = mk + 1 is a prime, and let g be a prime power with gcd(g, r) = 1 . Let 1C be 
the unique subgroup of order k of the multiplicative group of = TLjrTL, then 
for any primitive rth root [5 of unity in F^mfe , the elements 

« = E (3) 

is called a Gauss period of type (m, k) over F^. It can be checked that a G F^m. 

Gauss periods have been used to construct normal bases with low comple- 
xity [15,3]. A Gauss period of type (m, k) over F 2 naturally introduces a normal 
basis I2 = {o',a‘^ , . . . , ) in F2m over F 2 if and only if gcd(e, m) = 1 , where 

e is the index of 2 modulo r. Furthermore, such a normal basis has complexity 
at most mk' — 1 with A:' = A: if A: even and A: + l otherwise [3,18,7]. Clearly, Gauss 
periods of type (m, 1 ) and (m, 2 ) generate optimal normal bases with complexity 
2m — 1, which are usually called type-I and type-II optimal normal bases (ONE), 
respectively [15]. 

On the other hand, a redundant basis in this case can be given as ii = 

( 1 , . . . , Consider two sets of km elements in F 2 fem: Si = , i = 

0, 1, . . . , m — 1; j = 0, 1, . . . , A: — 1} and S 2 = For any element 

e Si, we have = ^ 2 E-^mod(m/c+i) ^ thus. Si C S 2 . Let G = 

^km+i then G = ( 2 , 7 ). For any integer I G {1,2, .. . , km}, there exist integers 
i G {0, 1, . . . , m— 1} and j G {0, 1, . . . ,k— 1}, such that I = 2 * 7 -^ mod {km + 1). 
Therefore, S 2 C Si ^ S 2 = Si. 
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k-l k-1 k-1 

Since I 2 = (a, o;^ , . . . , ^ fP , , • • • , P‘^ ^ ) and each 

i=0 i=0 i=0 

element in I 2 is a sum of k elements in 5i, it can be seen that elements in 5i(= 

S 2 ) can serve as a basis in ¥ 2 m and which is a permutation of (/?, /?^, . . . , f3^^) = 
I 3 . Obviously, the redundant basis can be obtained by adding element H’ to the 
basis 1 3 . 

Conversion between normal basis and redundant basis Now let us look at the 
conversion from the normal basis I 2 = ^ ) to the redundant basis 

ii- As we have seen before, the conversion between redundant basis Ii and the 
basis consisting of elements from S\ is simple. If A = (aQ,a'^, . . . ,a'^_^) with 
the normal basis, then with the basis from 5i, 

A / // // // // \ 

^ \^0,0 7 ^0,1 7 • • • 7 ^0,/c — 1 7 • • • 7 1,/c — 1 / 7 

where a'F = a- for j = 0, 1, . . . ,k — 1 and i = 0,l,... ,m— 1. 



4.2 Redundant Basis and Polynomial Basis 

Given a basis 1 in F 2 m, the general case of basis conversion between 1 and the 
redundant basis Ii may not be trivial. If i is a normal basis generated with the 
Gauss period of type (m^k), then how to obtain Ii has been discussed in the 
last section. If i = (1, a, . . . , is the polynomial basis, and if we know that 

the order of element a is ord(o;), then the redundant basis I\ can be obtained 
using the following algorithm: 

Algorithm 2 Computing the redundant basis from a polynomial basis (1, a, . . . , a^~^} 

1 . Compute n using Algorithm 1 ; 

2. Compute the order of the irredueible polynomial ord{a); 

3. Let t = ord{a)ln, then the redundant basis is given by (1, . . . , 

□ 

It can be shown that for the field that there exists an ESP, the value of n is 
always between m + 1 and 2m. 

5 Squaring Operation Using Redundant Basis 

Let (1, /?, /?^, . . . , be a redundant basis for F 2 m over F 2 . For a field element 

represented in the redundant basis: 

A = uq + ai/3 + • • * + ^ 

its square is given by 

A? = uq + ai(3^ + • • • + aji—i(3^^'^ . 

Since = 1, we have that aj(3‘^^ = aj(3‘^P^ if 2 j > n — 1. It can be seen that a 
squaring operation using the redundant basis is equivalent to a permutation of 
the element coefficients. 
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6 Inversion with Redundant Basis 

The problem of inversion in redundant basis is as follows: Given a field element 
A = ao a\(3 + • • • + an—ip^ ^ C ^ 2 ^ 

find 

B = t)Q hi(3 + • • • + br^_if3^ ^ G ¥2m 

which is the inverse of A. Clearly, the methods proposed by Itoh and Tsujii [10] 

o 

and by Agnew et al [1] can be used. With their methods, about ^ log (m — 1) 
multiplications on average and (m — 1) squaring operations are required. Since 
squaring operation in the redundant basis is a permutation of lines and free, while 
the multiplication can be efficiently implemented in hardware, it is expected that 
with this method inversion using the redundant basis can be as good as using 
normal basis. 

Another method for inversion is to solve a set of linear equations. From 
AB = 1, we have 

n — 1 n — 1 

i=0 i=o 

n — 1 n — 1 

J=0 2 = 0 

n — 1 n — 1 

= E^iE«C*+^'^ 

i = 0 2 = 0 

n — 1 n — 1 

= ^ j 

j=0 2=0 

where (x) = x mod (n), or, 

^0 ^n — 1 ^n — 2 * * * 

Oj\ Q/Q Ojti — 1 * * * n<2 
Oj2 Ojq • • • ci^ 

^n — 1 ^n— 2 ^22—3 ' ' ' ^0 

The circulant matrix is always singular and the equations allow many solutions, 
all of which is a representation of the inverse B in the redundant basis. Note 
that the circulant matrix is a special case of Toplitz matrix and any algorithm 
for solving Toplitz system can also be used to solve (4). 
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7 Examples 

Example 1. For the field ¥ 2 ^^ , we can compute the smallest cyclotomic field that 
includes it as a subfield is . Highly regular architectures for bit-serial and bit- 
parallel multipliers using redundant basis can be built as discussed in Section 3. 
Clearly, a bit-serial multiplier requires 11 AND gates, 10 XOR gates, and 11 
1-bit registers. It takes 11 clock cycles to accomplish a multiplication operation. 
The complexities for a fully bit-parallel multiplier are: 121 AND gates, 110 XOR 
gates and a propagation delay of Ta + |~log2(m + l)]'^x- 



Example 2. From Algorithm. 1, we find the smallest cyclotomic field that includes 
F26 as a sub field is F*^^\ Let the redundant basis in F26 over ¥2 be given by 
(1, where (3 is a primitve 9^^ root of unity in F26. In fact, f3 is 
a root of irreducible polynomial x^ -\- x^ ^ 1. The complexities of the bit-serial 
redundant basis multiplier are 9 AND gate, 8 XOR gate, 9 1-bit registers and 9 
clock cycles for performing a multiplication operation. 



Example 3. It can be computed that the redundant basis in F28 has 17 elements 
(m = 8 and n = 17). Then the redundant basis multipliers can be built and their 
complexities can be decided. 

8 Summary 

In this paper, we have presented redundant bases and their applications to the 
construction of multipliers. It has shown the new constructions are advantage- 
ous over other normal basis constructions when bit-parallel or partial-parallel 
structures are required. The comparisons have been made between redundant 
basis and normal basis, since the squaring operation using redundant basis is 
also a simple cyclic shift of lines. The inversion using the new basis has also 
been discussed. It can be shown that the polynomial ring basis proposed in [4] 
is a subset of the redundant basis. 
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Abstract. Bit-parallel finite field multiplication in ¥ 2 ^ using polyno- 
mial basis can be realized in two steps: polynomial multiplication and re- 
duction modulo the irreducible polynomial. In this article, we prove that 
the modular polynomial reduction can be done with (r — l)(m — 1) bit 
additions, where r is the Hamming weight of the irreducible polynomial. 

We also show that a bit-parallel squaring operation using polynomial 
basis costs not more than ~ ^ J operations if an irreducible 

trinomial of form + 1 over F 2 is used. Consequently, it is argued 

that to solve multiplicative inverse in using polynomial basis can be 
as good as using normal basis. 

1 Introduction 

The increasing use of cryptographic techniques in computer and communication 
network systems has inspired many researchers to find ways to perform fast 
or bit-parallel algorithms and architectures over finite fields of characteristic 
two. Besides the discrete logarithm cryptosystems over F 2 ^, the elliptic curve 
cryptosystems, which utilize the group of points on an elliptic curve over a field, 
can also be realized using finite fields of characteristic two. These groups are 
generally used to take advantage of their efficiency over multiprecision arithmetic 
for large prime fields. The elliptic curve cryptosystems also have the advantage 
of their high cryptographic strength relative to the key size, and thus they are 
especially attractive in applications such as the financial industry, smart cards 
and wireless areas where power and bandwidth are limited. 

There are generally three types of basis in finite field, namely, normal basis 
(NB), polynomial basis (PB) and dual basis (DB). Normal basis is often chosen 
in cryptographic application, since squaring operation is only a cyclic shift of the 
fines and thus inversion and exponentiation can be efficiently performed. Massey 
and Omura first found a regular architecture for normal basis multiplication [13], 
while the use of the optimal normal basis further reduces the complexity of mul- 
tiplication [16]. Polynomial basis has long been used for finite field arithmetic. 
Polynomial basis multiplication based on the irreducible trinomial + 1 

with 1 < A: < ^ are attractive because they require fewer bit operations for 

modular reduction. Mastrovito has proposed a bit-parallel multiplication algo- 
rithm and architecture when f(x) is a trinomial [14]. He has shown that the 

C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 280-291, 1999. 

(c) Springer- Verlag Berlin Heidelberg 1999 
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number of both bit multiplications and bit additions needed is proportional to 
2m^ when the degree of f{x) is no greater than 15 and not equal to 8. The 
Karat suba-Ofman (KOA) algorithm has also been considered for building bit- 
parallel finite field multipliers [l,!"^]* implementation of KOA [17] has shown 
that bit-parallel multiplication architectures in certain composite fields can have 
significantly lower complexity, compared to those proposed in [14]. However, the 
time delay of the architectures using the KOA can be longer. Polynomial dual 
basis and normal dual basis have also been considered for efficient multiplica- 
tion [21,19]. 

In this article, we prove that bit-parallel reduction modulo the irreducible 
polynomial costs only (r — l)(m — 1) when the irreducible polynomial /(x) has 
the Hamming weight of r. Consequent work can be shown that a bit-parallel 
multiplier in F 2 m over F 2 can be built with at most AND gates and — 1 
XOR gates for any integer m when an irreducible trinomial of degree m exists. 
Polynomial basis bit-parallel squaring is also discussed. When the irreducible po- 
lynomial is chosen as a trinomial of form x'^ -h -h 1 , then bit-parallel squaring 
operation can realized with no more than ^ ~ ^ bit additions. Conse- 

quently, it is argued that to solve multiplicative inverse in ¥ 2 m using polynomial 
basis can be as good as using normal basis. 

The organization of this paper is as follows: Polynomial basis bit-parallel 
multiplication and squaring are discussed in Section 2 and Section 3, respectively. 
In Section 4, we argue that to solve multiplicative inverse using polynomial basis 
can be as good as using normal basis. Finally, a few concluding remarks are given 
in Section 5. 

2 Bit-Parallel Polynomial Basis Multiplication in ¥ 2 m 

Let the finite field ¥ 2 m be generated with an irreducible r-term polynomial 



r — 2 m — 1 

/(x) = x^ -h ^^x®% where 0 = cq < ei < • • • < e ^_2 < m. Let A(x) = a^x* 

2 = 0 2 = 0 

m—1 m — 1 

and B{x) = 6^x* be any two elements in ¥ 2 m. Then, C{x) = c^x* G F 2 m, 

2=0 2=0 

the product of A{x) and B{x) can be obtained in two steps: 



1. Polynomial multiplication: 



S{x) = A{x)B{x), 



2m — 2 

where S{x) = s/j.x^, and is given by 



s/j. = ciibj^ A: = 0, 1, 2, . . . , 2m — 2. 

i-\-j=k 
0^2,j^m — 1 



( 1 ) 
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2. Reduction modulo the irreducible polynomial: 

C(x) = S(x) mod /(x), 



( 2 ) 



m — 1 

where C(x) = q G F 2 . 

i=0 

Obviously, the complexities of polynomial basis bit-parallel multiplication in F 2 m 
are determined by these two parts. The complexity of the first step (polynomial 
multiplication) is independent of choice of the irreducible polynomial f(x), and 
it has been shown to be 0(m log m log log m) in bit operations [18]. We will show 
that the second step (modular reduction) requires at most (r — l)(m — 1 ) bit 
operations, where r is the Hamming weight of the irreducible polynomial f(x). 

2.1 Polynomial Multiplication 

In the first step of PB multiplication (1), if S(x) is computed from A{x) and 
B{x) by the conventional polynomial multiplication method, it requires m? mul- 
tiplications and (m — 1 )^ additions in the ground field and the time delay is 
Ta + [log 2 m]'Jx- However, there are some asymptotically faster methods for 
polynomial multiplication over finite fields [3], such as, the Fast Fourier Trans- 
form method [3,11] and the Karatsuba-Ofman algorithm [10,1,17]. They can 
result in asymptotically fewer bit operations at the expense of longer time delay 
and/or certain costly pre- and post-computations. Another technique for po- 
lynomial basis multiplication that can combine polynomial multiplication with 
modulo reduction into one single step is called the Montgomery method [15,12]. 

2.2 Reduction Modulo a Polynomial 

For modular reduction C{x) = S{x) mod /(x), where deg / = m,deg S < 2 m — 2 
and deg (7 ^ m — 1 , if the conventional polynomial division method is used, the 
complexity is 0{m?) in ground field operations. Mastrovito [14] has found that if 
the irreducible polynomial is chosen properly for m ^ 15,m 7 ^ 8 , the complexity 
of modulo reduction can be greatly reduced by using some partial sums. Paar [17] 
has also discussed this issue for certain small values of m. However, their methods 
are based on computer based exhaustive search and available for only moderately 
small size fields. In the following, we will present a new algorithm that can 
perform modulo reduction in (r — l)(m — 1 ) ground field operations for any 
irreducible polynomial f{x) with the Hamming weight r. 

Theorem 1. If the Hamming weight of the irreducible polynomial f{x) is r, 
then the modular polynomial reduction ( 2 ) can be done with (r — l)(m — 1 ) bit 
operations. 

Proof: Define 

m-\-l m — 1 

Six'^ mod f{x) = tf^x\ I = — 1 , 0 , 1 , . . . , m — 2 . 



( 3 ) 
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have the initial values t- = Si, and, we try to solve for the ‘finaF values 

^(m-2) ,TO-1. 

In the following, we shall prove by induction that the complexity of solving 
^(m-2) I — \ . . . , m — 1, is (r — l)(m — 1) bit operations. 

When / = 0 , from ( 3 ) we have 



m — 1 



(0). 



m m— 1 

SiX'^ = 

i =0 i =0 

m— 1 

= ^ + 2^®^ H 

i=0 



Clearly, = [^\ if * - ^i, ea, . . . ,e^_2, 

Ityt iff < i < m — 1 , and i ^ ei , 62 , . . . , e^_2 . 

It can be seen that r — 1 bit additions are required for obtaining from 
, i = 0, 1, . . . , m — 1. 

Assume when 0 </</', r — 1 bit-additions are required for obtaining t^p 
from P i = 0 , 1 , . . . , m — 1 . Then, when I = we have 



m — 1 m+^ — 1 

E = E = E 

i=0 2=0 2=0 

m— 1 






2 = 0 
m— 1 



y ^ + S^+yX^ [1 + X^^ + • • ' + X^^ 



2 = 0 



If m > /' + e^-2, then 




^ + Sm+l' ^ if i — /', /' A 6i , . . . , + ^r- 2 j 

, otherwise. 
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Obviously, tf ^ can be computed from tf using r — 1 bit additions. Now 
suppose that I' + e^/_i < m ^ + e^/, r' G {1, . . . ,r — 2}, thus it follows 

m — 1 

y~^ SjX® = { ^ [1 + X®' + h 

i =0 2=0 

+ + 1 + • • • + 

m — 1 

= [x®^' + X®^' + 1 + • • • + 

2 = 0 
m — 1 

= ’ ^X* + S^+^/X^ + • • * + S^ + ^/X^ +6r-2 ^4^ 

2 = 0 

m — 1 m—1 

where ^ ^ ~^^x* + s^+^/x^ [1 + x®^ + • • • + x®^'-i]. Since we have 

2 = 0 2 = 0 



(/',0) _ / 4 ^ Sm+l' J if i + 6i, . . . , 



t) 



il'-l) 



otherwise, 



it can be seen that r' bit additions are required to obtain tf from t[^ ^ = 

0, 1, . . . , m — 1. 

In the following we shall prove that \ i = 0, 1, . . . , m — 1, can be obtained 
from t with r — r' — 1 bit additions. Define 



m — 1 



m — 1 



tf ’^^x* = tf ’^^x* + mod /(x), i = 0, 1, . . . , m — 1. (5) 

2=0 2=0 

Since 0 < /' + e^/ — m < /', we have 

m — 1 m — 1 



y~^ tf + s//+e^/X^ mod /(x), i = 0, 1, . . . , m — 1. 

(6) 



2 = 0 



2 = 0 



Since tf has been obtained from tf ^ with r — 1 bit additions 

as assumed, comparing (5) to (6), we can see that (5) and (6) can be combi- 
ned together to save bit operations. That is, when / = /' -h e^/ — m, instead of 
performing (6), we perform 

m — 1 m — 1 

T] 4 + S„+(/)x'' + ®>-' mod f{x) = tf +^r' -m,*) 

2 = 0 2 = 0 

( 7 ) 

with r bit additions, while (5) can be saved. In the sense of the count of bit 
operations, we may equivalently say that (5) requires one bit addition, while 
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(6) still needs r — 1 bit operations. Similar arguments can be applied to the 
remaining r — r' —2 terms ^ 3 = r' + 1, r' + 2, . . . ,r — 2 in (4). Thus 

for / = /', r — 1 bit additions are required for obtaining ^ from for i = 

0, 1, . . . , m — 1. Therefore, to compute from i = 0, 1, . . . , m — 1, needs 

r — 1 bit additions for any integer 1. We conclude that computing 
from i = 0, 1, . . . , 2m — 2 requires (m — l)(r — 1) bit additions. □ 

Theorem 1 can be easily extended to F^m as it is stated in Theorem 2. A 
proof for Theorem 2 is analogous to that of Theorem 1. 

Theorem 2. If the monic irreducible polynomial f{x) G ¥q[x] of degree m has 
the Hamming weight of r, then the modular polynomial reduction in polynomial 
basis multiplication can be done with (r — l)(m — 1) multiplications and (r — 
l)(m — 1) additions in F^. 



If the conventional method for polynomial multiplication is used, some results 
of consequent work on finite field multiplier architecture are shown as follows: 

If the finite field ¥ 2 m is generated with an irreducible trinomial f{x) = 1 + 

, then a bit-parallel polynomial basis multiplier can be 
2 



X^ + X 



1 ^ k ^ 
constructed with Csa 



m 



(i) Csx =iX -1 and Ct = Ta + { riog2 m\ + 1) Tx iov k = 1; 

(ii) Csx =rX -1 and Ct =2’a + { riog2 m] + 2)Tx for 1 < A; < 

(iii) Csx =iX - Y and Ct = Ta + ( [log2 m] + 1) Tx iov k = y- 



3 Polynomial Basis Bit-Parallel Squaring 



3.1 Complexity of Polynomial Basis Bit-Parallel Squaring in F 2 m 



Let f{x) be the irreducible polynomial over F 2 generating the field F 2 m. Let 
m— 1 



tie the polynomial representation of an arbitrary element of 

i=0 



m — 1 

F 2 ^n. The squaring operation of A{x) is C{x) = qx* = A^{x) mod f{x) = 

2 = 0 

ao + aix^ + a 2 ^^ + . . . + + . . . -h am-ix‘^^~‘^ mod f{x). It can be seen 

that squaring in F 2 m is actually a case of polynomial modular reduction that 
has been discussed in the last section, where the degree of each squared terms 
in A‘^{x) is an even integer between 0 and 2m — 2. From the discussion in the 
last section, the following corollary is obvious. 



Corollary 1. Let the field F 2 m be generated with the irreducible r-term po- 
lynomial f{x) of degree m. Then squaring a field element in parallel can be 
performed with at most (r — l)(m — 1) addition operations in F 2 . 

When f(x) is an irreducible trinomial, however, both the size complexity and 
time complexity can be further reduced. 
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Theorem 3. Let the field ¥ 2 ^ be generated with the irreducible trinomial 
/(x) = + 1, where m is even and k odd. Then squaring a field ele- 
ment in a bit-parallel fashion can be performed with at most ^ ^ ~ ^ ^)it 

operations. □ 

Proof: Let 



m — 1 



2m— 2 



A‘^{x) = a^x^* = a'x\ 






2=0 



A 



where a'- = ai if i even, and 0 if i odd. Define 

* 2 



m+2^ 



m— 1 



^ a'x® mod f(x) = ^ I = -1,0,1,. . • > ^ “ 1- 

2 = 0 2 = 0 

The terms have their initial values £^nd we try to solve the final 

values t • ^ = q, i = 0, 1, . . . , m — 1. 

When I = 0, 

m — 1 m m — 1 m — 1 

= X! = X 

2=0 2=0 



= <: 



Clearly, one bit addition is needed to compute from t[ = — 1. 

For / > 0, we have 

m — 1 m+2/ 

a'x* 

2 = 0 2 = 0 

m+2(^ — 1) 



2 = 0 


2=0 


^2 T 


i = 0; 




i = k. 




l,i even; 


0, 


i odd, i ^ k] 



m+2^ 



— X ®m+2C‘ 

i=0 
m — 1 

= X + A+2(a;^'(l + a;'') 



2 = 0 
m — 1 






kX2l 



2 = 0 
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If A: + 2/ < m or / < ^2 ^ 5 then 



0 ^ ^ ^ m — 1, ^ / 2/, ^ even; or ^ = A:, A; + 2, . . . , A; + 2(/ — 1); 
0, i odd, ^ / A:, A: + 2, . . . , A; + 2/; 

4 ^ + ^'m+2l^ ^ — 2^5 

^m+2/7 ^ = A; + 2/. 

(8) 



It can be seen that only one bit addition is required to compute from A-^ 
for 0 < / < and A = 0, 1, . . . , m — 1. 

In the following we proceed with induction. When I = m-k + l 
is odd) and / < we have 




m — 1 

E 


II 




i=0 


2 = 0 






II 






2 = 0 






II 





_l_ m+2l 

_|_ _|_ / k-\-2l 



Then, 



= < 



.d-i) 






u, 



'm-|- 2 ^ 7 
((-1) 



lo, 



rn-\-2l 



A = 2/ or A = A; + 1; 

A = 1; 

A even, A yA 2/, A: + 1; 

or A = A?,A?A2,... ,A?A2(^/ — 1); 

A odd, i ^ k,k ^2, . . . , k 2{l — 1) and i ^ 1. 



(9) 



Obviously, two bit additions are required to compute from i = 0 , 1 , . . . ,m — 1 . 
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Assume that for — — k ^ </</', (9) holds, then for / = /' < we have 



m — 1 m — 1 






2 = 0 
m — 1 



2 = 0 
m — 1 

X 4 * + +® m + 2 i '^^* '''^ 

2 = 0 
m — 1 

E f(^ _l_ _|_ k-\-2l' — m _|_ / 2/c+2/^- 

//^ X -|- a^_|_2pX -|- a^_|_2^/X “h a^_|_2^/X 



2 = 0 
m — 1 



I 2k+2l'-m 



(10) 



2 = 0 



where if is defined by ^ if ^ if + «m+ 2 i' 2 ^ 



d'.0)^i ^ d*'-l)^i , „/ ^21' , ! ^k+2l'-m 



2=0 



2=0 



Since 2 /' < m, and /^ + 2 /' — m is odd and less than 
f 4^ + ^m+ 2 ZM ^ = 2 /'; 



^a', 0 ) ^ ] 



a'rn+ 2 i'^ i = k^2l' -m; 

if ^ \ 0 ^ i ^ m — l,i ^ 21' ,i even; or i = k,k -\- 2, . . . ,k -\- 2{l' — 1); 

or ^ = 1, 3, . . . , + 2{V — 1) — m; 

0, i odd, ^ A:, A: + 2, . . . , A: + 2(/^ — 1), ^ 1, 3, . . . , 

k + 2{l' — 1) — m. 

( 11 ) 



Thus it requires one bit addition to obtain tf from tf i = 0, 1, . . . , m — 1. 

When 2k ^ 21' — m < m, we have tf ^ = tf if i 7 ^ 2 A: + 21' — m, otherwise 
4^ ^ = 4^ + '^m+ 2 r* therefore two bit operations are required 

to compute ^ from for i = 0 , . . . , m — 1 . 

When 2k -\- 21' — m m, consider 

m—1 m — 1 

E = E (12) 

2 = 0 2 = 0 



It can be seen that the last terms of the right hand side of (10) and (12) are the 
same except for the coefficient. At the step I = k-\-l' —m^ instead of performing 
( 12 ), if we perform 

m — 1 m—1 

E ,(/c+r-m,*) _ Ak+l'-m-l) / / / \ 2k+2l'-m /-i o\ 

b ~ 2-^ “T V«2/c+2/'-m ^ ^m+2/V^ 

2=0 2=0 
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at the cost of one more bit operation, then at step / = /', the term tf ^ can be 
computed from t- = 0^ ,, , , m— 1 with only one bit operation. Equivalently, 

we might say that at step / = /', term tf ^ can be computed from tf i = 

0, 1, . . . , m— 1, at the cost of two bit operations. Thus for ^ ^ ^ ^ — 1, 

it requires two bit additions at each step. 

We conclude that the total cost for computing a = from t\ = 

a', i = 0, 1, . . . - 1 is m-k + 1 + 2(f - 1 - = m+k-1 

operations. □ 

Theorem 4 . Let the field ¥ 2 ^ be generated with the irreducible trinomial 
f{x) = + 1, where m is odd and k even. Then bit-parallel squaring 

in F 2 m can be performed with at most ^ ~ ^ bit additions. 

Theorem 5 . Let the field F2m be generated with the irreducible trinomial 
f(x) = x'^ + + 1, where both m and k are odd. Then bit-parallel squaring in 

F2m can be performed with at most ^ ^)it additions. □ 

Proofs of Theorems 4 and 5 are similar to that of Theorem 3. 

Some results of consequent work on implementation of bit-parallel squaring 
operation done by us is given below. 

Theorem 6 . Let the field ¥ 2 ^ be generated with the irreducible trinomial 
/(x) = x'^ + x^ + 1, where m + A; is odd. Then a bit-parallel squarer can be 
implemented with at most ^ ~ ^ XOR gates. For A; = 1 or 2, the incurred 

time delay is Tx^ and for 2 < A; < it is 2'i’x- 

Theorem 7 . Let the field F2m be generated with the irreducible trinomial 
/(x) = x"^ + x^ + 1, where both m and k are odd. Then a bit-parallel squarer 
can be implemented with at most ^ XOR gates. The incurred time delay is 

Tx if A: = 1, and 2Tx if 2 < A: < y* 

4 Inversion 

Inversion operation is required in elliptic curve cryptosystem when computing 
point multiples. This operation is usually performed with two methods. One is 
the extended Euclidean algorithm and the other is to exponentiate the element 
using the following identity 



X ^ = x^ (14) 

The extended Euclid’s algorithm usually requires the field element having a po- 
lynomial basis representation [5], where the most used operations are field addi- 
tion, shifting and loading. Efficient algorithms have been proposed for the second 

q 

method, for example, [8] and [2]. Both algorithms use about ^ log (m — 1) field 
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multiplications on average^ and m—1 squaring operations. It has been generally 
accepted that normal basis should be used for this method since squaring in nor- 
mal basis is only a cycle shift of the coefficients [8,2]. However, with the results 
presented in this paper, we argue that to solve inverse using (14) polynomial 
basis representation can be as efficient as normal basis representation. 

It has been shown in [7] that a normal basis multiplication can be performed 
with 2m^ — 1 bit operations when a type I optimal normal basis is used. If a 
type II optimal normal basis or a non-optimal normal basis is used, it takes 
at least 3m^ bit operations to accomplish a field multiplication [16,6]. On the 
other hand. If the polynomial basis generated with an irreducible trinomial is 
used, a bit-parallel multiplication needs at most 2m^ — 1 bit operations while 
a bit-parallel squaring costs not greater than m bit operations. In this case, 
the complexity to solve the inverse using (14) in terms of bit operations with 
different bases is given in the following table. 



Table 1. The complexity (in bit operations) of inversion using the algorithms in [8]. 





Multiplications 


Squarings 


Type I optimal NB 


(2m^ — 1) (log2(m — 1) + i7(m — 1) — 1) 


- 


Type II optimal NB 


> 3m^ (log2(m — 1) + i7(m — 1) — 1) 


- 


Trinomial- generated PB 


(2m^ — 1) (log2(m — 1) + i7(m — 1) — 1) 


< m{m — 1) 



It can be seen from the table that the complexity using polynomial basis ge- 
nerated with an irreducible trinomial is comparable to that using type I optimal 
normal basis, while much smaller than that using type II optimal normal basis 
or non-optimal basis. Moreover, given finite field ¥ 2 m there is more chance that 
an irreducible trinomial exists than that a type I optimal normal basis does. In 
fact, for 2 ^ m ^ 1000, there is an irrducible trinomial in ¥ 2 ^ for 545 values of 
m while there exists a type I optimal normal basis for only 67 values of m [4,9]. 



5 Concluding Remarks 



In this article, we have shown that a bit-parallel multiplication operation in 
using polynomial basis can be performed in 2m^ + (r— 3)m— (r— 2) bit operations. 
We have also proven that a bit-parallel squaring operation using polynomial 

bit operations if an irreducible trinomial 



basis costs not more than 



m-\- k — 1 
2 



over F 2 is used. Consequently, it is argued that to solve multiplicative 
inverse in F 2 m using polynomial basis can be as good as using normal basis. 

Consequent work on implementation has shown that the resultant bit-parallel 
multiplier and bit-parallel squarer also have low time delay. 



Assume that the Hamming weight of m — 1 is i log (m — 1) on average. 



1 
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Abstract. Differential Power Analysis, first introduced by Kocher et 
al. in [14], is a powerful technique allowing to recover secret smart card 
information by monitoring power signals. In [14] a specific DPA attack 
against smart-cards running the DFS algorithm was described. As few as 
1000 encryptions were sufficient to recover the secret key. In this paper 
we generalize DPA attack to elliptic curve (EC) cryptosystems and de- 
scribe a DPA on EC Diffie-Hellman key exchange and EC El-Gamal type 
encryption. Those attacks enable to recover the private key stored inside 
the smart-card. Moreover, we suggest countermeasures that thwart our 
attack. 

Keywords. Elliptic curve, power consumption. Differential Power Ana- 
lysis. 



1 Introduction 

The use of elliptic curve in cryptography was first proposed by Miller [17] and 
Koblitz [12] in 1985. Since that time, a lot of attention has been paid to elliptic 
curves for cryptographic applications and it has become increasingly common to 
implement public-key protocols on elliptic curves over large finite field. Elliptic 
curves (EC) provide a group structure, which can be used to translate existing 
discrete-logarithm cryptosystems into the context of EC. The discrete logarithm 
problem in a cyclic group G of order n with generator g refers to the problem of 
finding x given some element y = of G. The discrete logarithm problem over 
an EC seems to be much harder than in other groups such as the multiplicative 
group of a finite field. No subexponential-time algorithm is known for the discrete 
logarithm problem in the class of non-supersingular EC. Consequently, keys can 
be much smaller in the EC context, typically about 160 bits. 

In this paper we consider attacks based on the monitoring of power con- 
sumption of smart-card EC implementation. Differential Power Analysis, first 
described by Kocher et al in [14], is a powerful technique that exploit the lea- 
kage of information related to power consumption. The attack was successfully 
applied to a DES implementation; as few as 1000 encryptions were sufficient to 
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recover the secret key [14]. More recently, the resistance of smart-card imple- 
mentations of the AES candidates against monitoring power consumption was 
considered in [1,3,5]. The conclusion was that straightforward implementations 
of AES candidates were highly vulnerable to power analysis. In this paper we 
show that naive implementations of ECC are also highly vulnerable to power 
analysis. 

The paper is organized as follows. After recalling the principle of EC operati- 
ons in section 2, we describe in section 3 the principle of our power consumption 
attack. In section 4, we apply the attack to some common discrete-logarithm ba- 
sed cryptosystems such as Difhe-Hellman key exchange [7] and El-Gamal public- 
key encryption [8]. Einally we suggest three countermeasures that prevent our 
attack. 

2 Elliptic Curve Group Operation 

2.1 Definition of an Elliptic Curve 

An elliptic curve is the set of points (x, y) which are solutions of a bivariate cubic 
equation over a field K (see [16]). An equation of the form : 

+ aixy + a^y = (i2xp‘ + a/^x: + (iq ( 1 ) 

where ai ^ defines an elliptic curve over K. 

If char K ^ 2 and char K 7^ 3, equation (1) can be transformed to : 

-h ax + 6 

with a^b £ K. 

In the field GF(2’^) of characteristic 2, equation (1) can be reduced to the 
form : 



+ xy = -h ax^ + b 

with a^b £ K. 

The set of points on an elliptic curve, together with a special point O called 
the point at infinity can be equipped with an Abelian group structure by the 
following addition operation : 

Addition formula [16] for char K^2,3: 

Let F = (xi,yi) ^ O he a point, the inverse of F is —F = (xi,— yi). Let 
Q = (x2,y2) 7^ C be a second point with Q 7^ —F^ the sum F F Q = (^3,Z/3) 
can be calculated as : 



with 



X3 = — Xi — X2 

j/3 = A(xi - X3) - yi 
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A = 



V2 - yi 

X2 — Xi ’ 



3x^ “h d 



2yi 



if P = Q. 



To subtract the point P = one adds the point —P. 

Addition formula for char K = 2 : 

Let P = (xi,yi) 7^ O be a point, the inverse of P is —P = (xi^xi Pyi). Let 
Q = (^2,^2) 7^ O be a second point with Q 7^ —P^ the sum P ^ Q = {xs,ys) 
can be calculated as : 



ys = A(xi + X3) -\-x3Py1 
^ ^ yi +Z/2 

Xi -hX2 

if P ^ Q and : 



X3 — “h A T u 

Z/3 = + (^ + 1)^3 

A _ I 

A — X\ T — 

Xi 



if P = Q. 

2.2 Computing a Multiple of a Point 

The operation of adding a point P to itself d times is called scalar multiplica- 
tion by d and denoted dP. Scalar multiplication is the basic operation for EC 
protocols. Scalar multiplication in the group of points of an elliptic curve is the 
analogous of exponentiation in the multiplicative group of integers modulo a 
fixed integer m. 

Computing dP can be done with the straightforward double- and- add ap- 
proach based on the binary expansion of d = , do) where d^_i is the 

most significant bit of d (the method is the analogous of the square- and-multiply 
algorithm for exponentiation) : 

Algorithm 1 (Double-and-add) 
input P 

Q P 

for i from ^ — 2 to 0 do 

Q y- 2Q 
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if di = 1 then Q ^ Q ^ P 
output Q 



Various techniques exist to speed-up scalar multiplication by reducing the 
number of elementary point operations : see [9] for a good survey, ff the point P 
is known in advance, it may be advantageous to precompute a table of multiples 
of P [2]. Because elliptic curve subtraction has the same cost as addition, the pre- 
vious double- and- add algorithm can be improved with the addition-subtraction 
algorithm which uses a signed binary expansion of d : 

t-i 

d = 

i=0 



with q G {—1,0, 1}. 

The non-adjacent form (NAF) of d is a signed binary expansion of d with 
qq+i = 0 for alH > 0. Each positive integer has a unique NAF. Moreover, the 
NAF of d has the fewest nonzero coefficients of any signed binary expansion of d 
[9]. [18] describes an algorithm that generates the NAF of any positive integer. 

Algorithm 2 (Addition-subtraction method) 
input P 

Q P 

for i from £—2 to 0 do 

Q <r- 2Q 

if q = 1 then Q ^ Q P P 
if q = —1 then Q ^ Q — P 
output Q 

The double-and-add method and addition-subtraction method can be gene- 
ralized to the m-ary method, the window method and the signed binary window 
method [9,15]. 

The problem of finding a method to compute dP with the fewest number of 
elliptic curve group operations for a given d is equivalent to finding the shortest 
addition-subtraction chain for d [9]. An addition chain [11] for d is a sequence of 
positive integers : 



Uq — 1 — Qj\ — ^2 — ^ • • • — ^ — d 

such that Ui = aj + aj., for some k < j < i, for alH = 1, 2 , . . . , r. 

An addition chain can be extended to an addition-subtraction chain [11] with 
Ui = Puj zb Uk in place of Ui = Uj + The shortest addition-subtraction chain 
for d gives the fewest number of elliptic group operations for computing dP by 
computing aiP, 0.2 P^ ... o.rP = dP. 
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3 Recovering d in Q = dP from the Power Consumption 

In 1998, Kocher described in a technical draft [14] Simple Power Attacks (SPA) 
and Differential Power Analysis (DPA) on DES. A SPA consists in observing 
the power consumption of one single execution of a cryptographic algorithm. A 
DPA is more sophisticated and powerful. It consists in performing a statistical 
analysis of many executions of the same algorithm with different inputs. 

Here we show that monitoring power consumption during the computation 
oi Q = dP knowing P may enable to recover d. First we show that a naive 
implementation of scalar multiplication may be vulnerable to SPA. However, 
it is not difficult to make the implementation resistant against SPA. We then 
describe a DPA attack of an implementation of scalar multiplication. 

3.1 Resistance against SPA 

Power consumption attacks are based on the observation that the power consu- 
med at a given time during cryptographic process is related to the instruction 
being executed and the data being manipulated. Power consumption enables to 
visually identify large features, for example the main loop in algorithm 1. Power 
consumption analysis may also enable to distinguish between instruction being 
executed. For example, it might be possible to distinguish between point doub- 
ling and point addition in algorithm 1, thereby revealing the bits of the exponent 
d. 

In order to be resistant against SPA, the instructions performed during a 
cryptographic algorithm should not depend on the data being processed, e.g. 
there should not be any branch instructions conditioned by the data. It is easy 
to modify algorithm 1 to achieve this goal : 

Algorithm 1 ’ (Double-and-add resistant against SPA) 
input P 

Q[0] <r- P 

for i from £—2 to 0 do 

Q[0] <— 2Q[0] 

Q[1]^Q[0]PP 
Q[0] ^ Q[df] 

output Q[0] 



3.2 DPA against Double-and-Add Algorithm 

In this section we describe a DPA against an implementation of algorithm 1’. 
We assume that the algorithm is performed in constant time. Otherwise the 
implementation may be subject to timing attack [13] and Simple Power Attacks 

[14]. 

DPA on DES [6] algorithm as described in [14] uses correlation between power 
consumption and specific key-dependent bits which appear at known steps of the 
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encryption computation. For example, a selected bit b at the output of one SBOX 
of the first round will depend on the known input message and 6 unknown bits of 
the key. In [14], the correlation between power consumption and b is computed 
for the 64 possible values of the 6 unknown bits of the key. The correlation is 
likely to be maximal for the correct guess of the 6 bits of the key. The attack 
can be repeated for the remaining SBOXes, thus revealing 48 bits of the key. 
The remaining 8 bits of the key can be recovered by exhaustive search. 

A Differential Power Analysis on algorithm 1’ in section 3.1 can be performed 
by noticing that at step j the processed point Q depends only on the first bits 
. . . ^dj) of d. Now assume that we know how points are represented in 
memory during computation and select a particular bit (the same for all points) 
of this representation. When point Q is processed, power consumption will be 
correlated to this specific bit of Q. No correlation will be observed with a point 
not computed inside the card. Thus it is possible to successively recover the bits 
of the exponent by guessing which points are computed by the card. 

The second most significant bit di -2 of d can be recovered by computing 
the correlation between power consumption and any specific bit of the binary 
representation of 4F. If d ^-2 = 0, AP is computed during algorithm 1’, and 
power consumption is thus correlated with any specific bit of AP. Otherwise if 
de -2 = 1, 4/^ is never computed, and no correlation will be observed with AP. 
This gives d^_ 2 . The following bits of d can be recursively recovered in the same 
way. 

Assume that algorithm 1’ is performed k times with distinct Pi,P 2 , • • • , A: 
to compute Qi = dP\^Q 2 = dP 2 , • • • ^Qk = dPk- Let Ci{t) be the power con- 
sumption associated with the i-th execution of the algorithm for 1 < i < A:. Let 
Si be any specific bit of the binary representation of APi ioi 1 < i < k. The 
correlation function g{t) between Si and Ci{t) can be computed as follows : 

g{t) =< Ci{t) >i=l,2...,k\Si=l - < CW >i=l, 2 ,...,A:|Si =0 (2) 

Assume that the points APi are processed at time t = ti, power consumption 
Ci(ti) will then be correlated with the specific bit Si of the binary representation 
of 4P^. The average of power consumption for those points APi for which = 1 
will be different from the power consumption for the points APi for which = 0, 
and function g{t) will present a ’’peak” at time t = ti. If the points APi are never 
computed, no ’’peak” will be observed in function g{t). This is illustrated in figure 
1 and 2d 



3.3 Extending the Attack to Any Scalar Multiplication Algorithm 

In this section we show how to extend the previous attack to any scalar multipli- 
cation algorithm executed in constant time with a constant addition-subtraction 
chain, i.e. for any point P the algorithm computes the sequence of point : 

^ Real power consumption curves were voluntarily excluded from this paper to avoid 
straightforward product identification. 
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Fig. 1. Simulated correlation function g(t) between the points APi and power consump- 
tion Ci(t) when di -2 = 0. A peak is observed corresponding to the computation of APi 
inside the card. 
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Fig. 2. Simulated correlation function g{t) between the points APi and power consump- 
tion Ci(t) when di -2 = 1. No peak is observed since the points APi are never computed 
by the card. 
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(IqI^ — . . . — ctrp — dP 

such that di = zba^ zb a/^, for some k < j < i, for alH = 1, 2, . . . , r. 

The attack consists in successively guessing the starting from ag = 1 to 
dr = d. At step i > 1, one constructs the set Ai of all possible a- = zba^ zb d^ 
for all 0 < A: < j < and for each a- G Ai computes the correlation function 
g{t) between the point d/-P and power consumption. If a peak can be observed 
in ^(t), this will indicate that the point d'-P has been computed by the device 
and thus di = d'-. This enables to recover d = dr in 0{r‘^) time. 

4 Attacks on Elliptic Curve Public Key Protocols 

In this section we apply the attack to elliptic curve public key protocols such as 
El-Gamal encryption and Diflie-Helman key exchange. The attack can not apply 
to the ECDSA signatures, since in this case scalar multiplication is performed 
with a random exponent instead of a fixed exponent. 

4.1 Elliptic Curve Encryption Scheme 

This scheme is analogous to El-Gamal encryption [8]. 

System parameters : 

An elliptic curve 8 over GF{p) or GF(2'^). 

The order of 8 denoted must be divisible by a large prime q. 

G e 8 of order q. 

Key generation : 

Secret key : d Gi? [!,<?— !]• 

Public key : Q = dP. 

Encryption of a message m : 

Pick k en [l,g- 1]. 

Compute the points kP = (xi, yi) and kQ = (x 2 , ^ 2 ), and c = X 2 + m. 

The ciphertext is (xi,yi,c). 

Decryption : 

Compute (x' 2 , ^ 2 ) = ^(^ 1 : Z/i) and m = c — x' 2 . 

The attack described before enables to recover d when the device decrypts 
the ciphertext (xi,yi,c) for various points (xi,yi). 

4.2 Elliptic Curve DifRe-Hellman Key Exchange 

The EC Diflie- Heilman protocol derives a common secret value z from one 
party’s private key and another party’s public key. The protocol is referenced as 
ECSVDP-DH (Elliptic Curve Secret Value Derivation Primitive, Diflie-Hellman 
version) in [10]. If the two parties correctly execute this primitive, they will 
produce the same output. 
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System parameter : 

An elliptic curve £ over GF{p) or GF{2'^). 

The order of £ denoted ij^£ must be divisible by a large prime q. 
Alice’s own private key s. 

Bob’s public key W. 

Derivation of the shared secret value z : 

Compute the point F = slV. 

If F = O output ’’error” and stop. 

The shared secret value is z = Xp, the x-coordinate of F. 



The attack described in the previous section recovers Alice’s secret key when 
she computes the point F = sW for Bob’s public key W . 

5 Countermeasures against DPA 

In this section we describe three countermeasures that prevent from the attack 
described in section 3. Recall that the attack enables to recover d when Qi = dFi 
are computed inside the card for various for 1 < i < A:. These three counter- 
measures are based on introducing random numbers during the computation of 
Q = dF. We underline that other attacks might of course not be thwarted by 
our countermeasures. 



5.1 First Countermeasure: Randomization of the Private Exponent 

Let be the number of points of the curve. The computation of Q = dF is 
done by the following algorithm : 

1. Select a random number k of size n bits. In practice, one can take n = 20 bits. 

2. Compute d' = dF A;. #5. 

3. Compute the point Q = d! F. We have Q = dF since = O. 

This countermeasure makes the previous attack infeasible since the exponent 
d' in Q = d' F changes at each new execution of the algorithm. 

5.2 Second Countermeasure: Blinding the Point P 

The method is analogous to Chaum’s blind signature scheme for RSA [4]. The 
point F to be multiplied is ’’blinded” by adding a secret random point R for 
which we know S = dR. Scalar multiplication is done by computing the point 
d(R F F) and subtracting S = dR to get Q = dF. The points R and S = dR 
can be initially stored inside the card and refreshed at each new execution by 
computing R ^ {—1)^2R and S ^ (—1)^25, where b is a random bit generated 
at each new execution. This makes the previous attack infeasible since the point 
F' = F F R to he multiplied by d is not known to the attacker. 
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5.3 Third Countermeasure: Randomized Projective Coordinates 

Projective coordinates [16] can be used to avoid the costly field inversion for 
point addition and doubling. The projective coordinates (A,T, Z) of a point 
P = (x^y) are given by : 



X y 

Another system of projective coordinates may be found in [10]. The projective 
coordinates of a point are not unique because : 

{X,Y,Z) = {XX,XY,XZ) (3) 

for every A ^ 0 in the finite field. 

The third countermeasure consists in randomizing the projective coordinate 
representation of a point F = {X,Y, Z). Before each new execution of the scalar 
multiplication algorithm for computing Q = dP^ the projective coordinates of P 
are randomized according to equation (3) with a random A. The randomization 
can also occur after each point addition and doubling. 

This makes the attack described above infeasible since it is not possible for 
the attacker to predict any specific bit of the binary representation of P in 
projective coordinates. 

6 Conclusion 

We have shown that unless protected, implementations of elliptic curve crypto- 
systems such as El-Gamal type encryption or Diffie-Hellman key exchange are 
vulnerable to Differential Power Analysis. We have introduced three countermea- 
sures that address specifically these attacks. Those countermeasures are easy to 
implement and do not impact efficiency in a significant way. However, we do not 
pretend that those countermeasures thwart from all kinds of power attacks, since 
it may be possible to exploit the information leakage through power consumption 
in a different way. 
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Abstract. This paper describes a new type of attack on tamper-re- 
sistant cryptographic hardware. We show that by locally observing the 
value of a few RAM or adress bus bits (possibly a single one) during the 
execution of a cryptographic algorithm, typically by the mean of a probe 
(needle), an attacker could easily recover information on the secret key 
being used; our attacks apply to public-key cryptosystems such as RSA 
or El Gamal, as well as to secret-key encryption schemes including DES 
and RG5. 



1 Introduction 

In recent years, many researchers have started investigating the security 
of tamper-resistant devices such as smart-cards. Along many other cryp- 
tanalytic attacks on cryptographic algorithms, new attacks have been 
suggested. These attacks usually assume the existance of some kind of 
side-channel and retrieve secret information on the process being execu- 
ted aboard the device [1]. 

We distinguish between two types of side-channel attacks, namely 
passive or intrusive attacks. Typical examples of the first kind are timing 
attacks [10,7] and power attacks [11], in which the execution time or the 
power consumption are monitored while secrets are being handled by the 
device. Other examples are side-channel attacks described by Schneier et 
al. which include (for instance) carry bit analysis [17]. 

On the other hand, some authors described agressive scenarios which 
consist in influencing or perturbating the behavior of the device in order to 
infer the secret. These attacks include Boneh et aZ.’s induction of transient 
faults during RSA computations [3] or even cutting wires and forcing 
given bit values, such as in Differential Fault Analysis [2,14] of DES. 

In this paper we consider a new kind of passive attacks, which ap- 
pears to be more powerful than the previous ones. No statistical analysis 
is needed in most cases. We suppose that the attacker simply has access 
to a probe station, which (for the non-specialist) is a kind of needle that 
allows to monitor the value of a single bit during the execution of some 
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© Springer- Verlag Berlin Heidelberg 1999 




304 



H. Handschuh, P. Paillier, and J. Stern 



cryptographic algorithm. Such devices are, of course, not self-sufficient. 
In practice, the attacker must hrst prepare the surface of the chip in 
specihc way and overcome a long list of technical problems such as line 
width, protective layers, the lack of information allowing to match a phy- 
sical chip location with a given gate, security detectors, and other purely 
physical phenomena such as the needle’s electrical hading or pure sig- 
nal synchronization problems. More, depending on whether it is known 
to which register the bit belongs, probing may present a wide range of 
hardness degrees. Part of the analysis consists of guessing which bit is 
being recorded and once this is done, infering the secret key or private 
exponent becomes easy. 

The paper is organised as follows. The next two sections investigate 
probing attacks on RSA and DSA-like cryptosystems. Section 4 and 5 
focus on applying specihc probing-based cryptanalysis on secret key en- 
cryption schemes such as DES and RC5. 

2 Probing Attacks in Public-Key Cryptography 

ost public-key cryptosystems require modular exponentiations. Unless 
specihcally adressed by a dedicated hardware design, the modular expo- 
nentiation is usually available aboard a cryptographic device as a software 
implementation. Although many variants exist, most real-life devices im- 
plement the well-known Square-and-Multiply algorithm in its nominal 
version. 

This section is intended to introduce a new (probing-based) cryptana- 
lytic attack that completely recovers the exponent of a typical Square- 
and-Multiply implementation, thus providing a tool for breaking RSA, 
El Gamal, DSA, Schnorr-type signature schemes, and so forth. We hrst 
introduce and comment our adversarial model. 

2.1 Our Attack Model 

We will denote by SM-1 the standard Square-and-Multiply algorithm that 
outputs mod n, given a base m, an exponent d and a modulus n. 
During the modular exponentiation, d is scanned bit by bit from left to 
right and modular multiplication or squarings are successively applied to 
an accumulator A depending on the current bit exponent. Denoting by 
\d\ the bitlength of d, we recall the procedure on Eig. 1. 

In the sections that follow, we will be considering an attack scenario 
in which the adversary is given access to some bits of the accumulator 
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Initialization 

N ^ n, M ^ m 
A i — 1, i 4 — \d\ 

Scanning Loop 

While (1 < i) 

{ 

A 4^ A ' A mod N 

If (d[{\ == 1), A^A-M mod N 

i ^ i — 1 

} 

Output A = m4 mod n 



Fig. 1. Standard Square-and-Multiply Procedure (SM-1) 



throughout the exponentiation and attempts to recover information about 
the exponent d. Before going further, we state that any attack in the above 
model could be sophisticated or generalized in various ways to support 
known variants of SM-1, such as right-to-left or multiple bit scanning. 

To be more precise, our model assumes that some “monitoring oracle” 
provides the adversary with the value of certain accumulator bits supppo- 
sedly updated at each execution of the internal loop of SM-1. This means 
that bit- values are collected by the attacker just after the accumulator 
was squared or squared-then-multiplied. Therefore, we implicitely consi- 
der that the monitoring oracle is capable of synchronizing perfectly its 
observations with the actual execution of SM-1 aboard the device. 



2.2 Probing Attack of SM-1 

In this section, we show how to infer d by probing a single computation 
of the form mod n for given m and n when it is known that the expo- 
nentiation is done by SM-1. We will first consider the case when a few bits 
(possibly a single one) are probed at known positions J C {1,***,|A|} in 
the accumulator. We denote by A(J) the set of bits appearing at positions 
belonging to J in A. Section 2.3 extends the attack to the case when J is 
unknown. 

Let di be the integer formed by the i leading bits of d, for f = 1, • • • , |d|. 
Clearly, after the f-th step of SM-1 was executed, the accumulator con- 
tains the value mod n. Meanwhile, the monitoring oracle has 





306 H. Handschuh, P. Paillier, and J. Stern 

provided the attacker with the sequence 

T^={A\J),A\J),---,A\J)) . (1) 

Now if 5 is a guess for di^ the attacker can easily simulate SM-1 given 
(nij rij 6) and collect the bits at positions J in the simulated accumulator 
i.e. the sequence 

TI{6) = {A'\J),A'\J),---,A'\J)) . (2) 

It is now clear that 5 is a correct guess only if Ti = T[{5). Then, the 
procedure can be iterated at step i + 1 by relying on the surviving guesses 
at step i. This attack strategy is summarized in Fig. 2. 



For (i = 1, • • • , \d\) 

{ 

(1) Z\o ^ — {2(5, 2(5 “t“ 1 I (5 (E A}' 

(2) A ^ {S \S e Ao and Ti = T'{S)} 

} 

return A 



Fig. 2. The Adversary’s Strategy 



Since Ti = T-{di) for all it is clear that when the attack ends A 
contains at least d = d|^|. The point here resides in detecting whether 
the number of guesses to be checked is likely to explode or not while 
carrying out the attack. Although it seems hard to answer this question 
in the general case, one can still adress it by the mean of a heuristic 
reasoning. Let us first consider stage (2) during which wrong guesses are 
eliminated from Aq. At step each element 5 of Aq entering (2) fulfills 
Ti_i = where 5i-i = [5/2J. Furthermore, 5 E Aq passes the 

test if (and only if) the equality A^(J) = holds. This happens with 

probability : 

p{6) = P[A\J) = A^\J) I A^-\J) = A'^-^{J)] . 

By heuristically assuming that bits and A^{J) are decorrelated^ 

for any wrong guess, we get that p{5) is nearly e = 2“I'^L Obviously, the 



^ in theory, these bits are somehow correlated, but the modular multiplication offers 
such excellent diffusion properties that our assumption remains highly realistic. 
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correct exponent yields p{di) = 1. Since stage (1) just doubles the number 
of elements in Z\, one has, denoting by Ui the average number of surviving 
guess after step that : 

n* = p{5) = p{di) + p{5) = 1 + e{2ui-i - 1) . (3) 

SEAo\di 



Consequently, we get : 



- £ - e{2eY) ^ 

l-2e - l-2e 



when 2e < 1 and Ui = I + ^ if e = 1/2. This yields : 

B(|4|) < ^ 

throughout the attack. If J has a single element, that is if a single bit is 
observed by the monitoring oracle, then £ = ^ and 

e(|zs|)<i + M. 

which proves that the attack has a low heuristic complexity. 



2.3 Random Hit Attacks 

We now address the case when the attacker does not actually know the 
position of the monitored bits. One could of course address the problem 
by executing the above strategy with all possible positions simultaneously 
(there are \A\\^\ such possibilities). Instead, we provide a much faster stra- 
tegy derived from the previously discussed one. The idea exploited here is 
that the adversary can guess the position of these bits while performing 
the attack. Again, at step i, 5 is a correct guess for di if (and only if) 
Ti coincides with some “simulated trace” that the attacker produces, for 
instance the complete accumulator history 

T'= (A'\ • • • , W^) . 

Let us define T/(j; 5) as the sequence (A'^(j), • • • , A'^(j)) . From now on, if 
Ti T/(j; 5) holds for some j G {1, • • • , |A|} then either 5 is a wrong guess 
for di or j ^ J. This motivates the attack strategy depicted in Fig. 3. 

In this setting, one can show that : 

^(|Z\|) < ly^ + |A|(2£)l^l with £ = 2-I‘^I, 
r Zc 

which means that the attack would succeed again. 



(4) 
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J •<— {1, 






Vi e J 


Vli •<— {0} 




For {i = 






{ 






For (i 


G J) 




{ 








^ {26, 26 + 1 


1 <5 e A } 


^3 


^ 1 5 e A, 


and Ti C T{{6)} 


} 






If A- 


= 0 then J ^ 


J-j 


} 






return Z! 


I = UjAj 





Fig. 3. Random Hit Attack 



2.4 Discrete-Log Based Signatures 

Several Discrete Log (DL)-based cryptosystems have been proposed in the 
literature. In virtually all of them (El Gamal [5], Schnorr [23], DSA [4]), 
a fixed known base g has to be raised to a random power k modulo some 
known prime p. The security of the cryptosystem relies on the randomness 
and the secrecy of the exponent fc, which plays in this context a role similar 
to the one of a secret key. 

From the above two attacks, it turns out that probing a single bit du- 
ring a single signature generation suffices to recover the random exponent 
k. The device’s secret key can thus be easily infered from the knowledge 
of k and the output signature. 



3 Probing Attacks on DES 

Following Biham and Shamir’s work [2] on Differential Fault Analysis 
of Secret Key Cryptosystems, as well as Schneier et al.’s ideas on side- 
channel attacks [17], we take a closer look at probing secret key algo- 
rithms, which may be considered as yet another side-channel where infor- 
mation leak. As we saw before, probing is considered as a passive attack, 
whereas cutting wires would be an agressive attack. 

The goal of this section is therefore to show that even a passive 
attacker may retrieve the secret key of a DES implementation given one 
single bit of information at each round. 
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3.1 The Information Leakage Model 

The attack we subsequently present works in the following context. Sup- 
pose an attacker uses an electronic station to locally observe the value 
of a given bit during the execution of DES. We require that the attacker 
have sufficient knowledge of the device to be able to recognize two spe- 
cific registers which are the R and L data registers on which the round 
function of an iterated Feistel [6] cipher applies. In the case of DES, any 
bit of one or the other register is enough to attack the first and the last 
round subkeys. 

3.2 Attacking DES 

DES [13] is a 16-round Eeistel scheme which can be described as follows : 
Let m be a 64-bit message divided into two 32-bit halves : the left half 
rriL and the right half rriR ; let c be the corresponding ciphertext and ki 
the i-th round 48-bit subkey. Finally let IP and IP~^ denote the initial 
and final permutations and F the round function of DES. The algorithm 
is briefly depicted on Fig. 4. 



{Lo\Ro) = IP{mL\mR) 
for 2 = 1 to 16 : 

Li — Ri — 1 5 

Ri = F(Ri-i, Ki) 0 Li-i; 
end for 

C = IP-\Rl6\Li6) 



Fig. 4. Brief Description of DES 



In this attack we ignore the initial and final permutations as these are 
public anyway. Suppose the plaintext is simply (Lo|i?o) and the ciphertext 
c = (Liq\Riq). We shall now explain how to recover 6 bits of the last round 
subkey. Assuming that the probe station enables us to record the value 
of bit number b of register L at each round. The T-function in DES is 
such that the output bits can be related to a given S-box having a 6-bit 
input. We refer the reader to [13] for more information on the structure 
of the T-function. As Liq = i?i 5 , for any ciphertext we can select the 6 
bits entering the F-function that produce bit b as an output. These six 
bits are exored with six bits of the secret subkey kiQ before entering a 
specific S-box. Therefore, for each possible value of the 6-bit secret key 
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entering the S-box, the attacker can compute the expected output bit 6* 
of the F'-function. Notice that she also has access to the real value of this 
bit as she knows bit b from the probe of L15 as well as the corresponding 
bit of the ciphertext Riq. Therefore, with one ciphertext, the attacker 
can eliminate those key guesses where the expected value 6* and the real 
value of bit b of the output of the round-function are not the same. On 
average, half of the key candidates will survive. Thus 6 bits of the key are 
recovered with 6 different ciphertexts. 

The same attack can be carried out on six bits of the secret key 
of the first round. DES happens to be designed such that these 6 bits 
are different from the 6 bits recovered from the last round subkey. As 
a matter of fact, the initial permutation on the secret key results in no 
two consecutive bits entering the same S-box among the first round and 
the last round subkey. Therefore a total of 12 bits can be recovered by 
this attack. The remaining 44 bits of the key can be found by exhaustive 
search. 



3.3 Discussion 

Note that the attack works on both registers R and L. As a matter of fact, 
if the attacker probes register i?, we can apply exactly the same attack as 
before because the iterated Feistel structure guarantees that R14 = L15. 
Thus we can still compute the input and expected output of the round 
function and compare the latter to the real value derived from Ru as well 
as from the ciphertext. 

This attack uses the same principle as the one described by Biham 
and Shamir, but does not require the attacker to be able to cut wires or 
induce faults. In our setting, the prober simply observes the value of a 
given bit throughout the execution of the block cipher. The complexity is 
very low : only a handfull plaintext/ciphertext pairs are needed, and the 
number of offline encryptions is 2^^. 

Finally, we note that the attack cannot be carried out on two-key 
triple-DES or triple modes of encryption using DES as a building block. 
This comes from the fact that only 12 bits (6 from the first encryption 
component and 6 from the last encryption component) of one of the secret 
keys can be recovered, therefore the overall exhaustive search on the re- 
maining key bits still amounts to 2^^^ offline encryptions. Additionnally, 
no output bit of the round-function depends on the input bit at the same 
bit position due to the bit permutation after the S-box layer. Thus the 
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intermediate values of bit b are of no use to the attacker if she cannot get 
any other information. 



4 Probing RC5 

As another example, let us consider RC5. Other iterated algorithms such 
as RC6 are equally vulnerable to probing. RC5 has been extensively stu- 
died from a regular cryptanalysis point of view. See for example [9,8, 

18], 

4.1 Description of RC5 

RC5 is an iterative secret-key block cipher designed by R. Rivest [15]. 
It has variable parameters such as the key size, the block size and the 
number of rounds. A particular (instanced) RC5 algorithm is denoted by 
RC5-w/r/b where w is the word size (a block is made of two words), r is 
the number of rounds and b the number of secret key bytes. Our attack 
works for every choice of these parameters. 

RC5 works as follows : the secret key is first extended into a table of 
2r + 2 secret tc-bit words Si. We will assume that RC5’s key schedule is 
one-way and focus on recovering the extended secret key table and not 
the secret key itself. The detailed description of the key schedule can be 
found in [15]. By letting (Lo,i?o) denote the left and right halves of the 
plaintext, the encryption algorithm is depicted on Fig. 5. 



Li = Lq + So 

Ri = Ro + 

for i = 2 to 2r + 1 do 

Li = Ri-i 

Ri = ((Li— I 0 Ri-i) ^ Ri-i) + Si 



Fig. 5. Brief Description of RC5 



The ciphertext is (T2r+i,R2r+i)- The transformation performed for a 
given i value is called a half-round : there are 2r + 2 half rounds. Each 
half-round involves exactly one sub- key Si. All additions are mod 2^ and 
the rotations are mod w. As usual, 0 denotes a bitwise exclusive or. 
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4.2 Probing Attack on RC5 

For our purposes, we suppose that the attacker once again has access to 
all intermediate values of some bit b of either register L or register i?, 
which is the case when she can probe one of these two in an iterated 
hardware implementation of the algorithm or a specific RAM buffer. We 
start by describing an attack where the adversary probes some register 
R. Our technique uses some of the ideas presented in [7] to derive the 
subkeys one by one in reverse order starting with A 2 r+i- 

Step 1. 

First, collect a few multiples of w plaintext/ciphertext pairs and sort 
them by the value of the log(u;) least significant bits of the left half of 
the ciphertext L 2 r+i. There should be at least a few texts available in 
each such ’category’. Then consider the texts which belong to category 
{w — b)[w] (i.e. the value of the log(tc) least significant bits of T 2 r+i is 
{w — b)[w]). We probe the value of bit b of register i?2r-i = L 2 r- Since 
we know the value of bit b of i?2r = T 2 r+i as well as the value of the last 
rotation from the left ciphertext half, we can compute the least significant 
bit of register L 2 r just before the last subkey addition. Therefore the least 
significant bit of the last subkey can be found. 

Step 2. 

Next, consider “category” {w — b + l)[tc]. After the last rotation, bit 
b of L 2 r will be in first position. Applying the same method as in step 1, 
we can derive the first bit of the last round subkey (taking into account 
the carry bit created by the addition of the least significant bit, which 
is by now already known to the attacker and so on for the remaining 
{w — 2) bits of the secret key. They are derived one by one using different 
ciphertexts, from the low order end to the high order end of the key. 

Step 3. 

After recovering the last sub key, decrypt one half-round, sort the ci- 
phertexts according to the new value of the log(tc) least significant bits of 
L 2 r and derive A 2 r* Derive all subsequent subkeys up to the very first four. 

Step 4. The last four subkeys can be found by cryptanalysing a two- 
round block cipher, which is straightforward. This concludes successfully 
the probing attack on RC5. 
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4.3 Discussion 

The attack works in a similar way when we probe register L directly. So 
the knowledge of a single intermediate bit at each round enables to de- 
rive the complete extended secret key and thus to further recover the 
initial secret key. On the average a few multiples of w known plain- 
text/ciphertext pairs are needed, in order to be sure that at each round 
at least one text corresponding to a given rotation value is available. If 
this should not be the case, the attacker can still query some more pairs 
of texts while she is working backwards towards the first rounds of the 
cipher. The complexity of this attack is actually very low and requires 
less than the exhaustive search of a single 32-bit subkey. Depending on 
the case, either ^o, S 2 or 5i, Ss have to be determined otherwise than by 
the above attack. They can either be recovered by the key schedule, or 
by guessing a few bits of S 2 or ^3 at a time, and checking for consistency 
on the corresponding bits of or Si. 

5 Conclusion 

We have shown that probing attacks are a powerful tool to derive infor- 
mation on secret keys in embedded hardware. The interesting feature of 
these attacks resides in that they are not desctructive, as many previously 
suggested attacks are. In essence, probing does not require the cutting of 
wires or inducing faults or even stressing the device to make it behave 
abnormally, for we just observe (spying) a single bit during execution. 
We have shown that public key algorithms using exponentiation or the 
discrete logarithm such as RSA or DSA, as well as secret key algorithms 
such as DES or RC5 would be vulnerable to such powerful attacks. 
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Abstract. This paper describes an algorithm for computing elliptic 
scalar multiplications on non-sup ersingular elliptic curves defined over 
GF(2^). The algorithm is an optimized version of a method described 
in [1], which is based on Montgomery’s method [8]. Our algorithm is 
easy to implement in both hardware and software, works for any elliptic 
curve over GF{2'^), requires no precomputed multiples of a point, and 
is faster on average than the addition-subtraction method described in 
draft standard IEEE P1363. In addition, the method requires less me- 
mory than projective schemes and the amount of computation needed 
for a scalar multiplication is fixed for all multipliers of the same binary 
length. Therefore, the improved method possesses many desirable fea- 
tures for implementing elliptic curves in restricted environments. 

Key words. Elliptic Curves over GF(2'^), Point multiplication. 



1 Introduction 

Elliptic curve cryptography first suggested by Koblitz [5] and Miller [12] is beco- 
ming increasingly common for implementing public-key protocols as the Diffie- 
Hellman key agreement. The security of these cryptosystems relies on the pre- 
sumed intractability of the discrete logarithm problem on elliptic curves. Since 
there is no known sub-exponential type algorithm for elliptic curves over finite 
fields, the sizes of the fields, keys, and other parameters can be considered shor- 
ter than other public key cryptosystems such as RSA with the same level of 
security. This can be especially an advantage for applications where resources 
such as memory and/or computing power are limited. 

Elliptic curves over GF{2'^) are particularly attractive, because the finite 
field operations can be implemented very efficiently in hardware and software. 
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See for example [1] for a hardware implementation of and [19] for a 

software implementation of 

Given an elliptic point F and a large integer k of about the size of the 
underlying field, the operation elliptic scalar multiplication^ kFj is defined to 
be the elliptic point resulting from adding P to itself k times. This operation, 
analogous to exponentiation in multiplicative groups, is the most time consuming 
operation of the elliptic curve cryptosystems. 

In this paper, the calculation of kP for a random integer k and a random 
point F is considered. An efficient scalar multiplication algorithm, which is an 
optimized version of an algorithm described in [1], is presented. The proposed al- 
gorithm is suitable for hardware and software implementation of random elliptic 
curves over GF{2^). 

2 Previous Work 

The basic method for computing kP is the addition-subtraction method descri- 
bed in draft standard IEEE P1363 [14]. This method is an improved version 
over the well known “add-and-double” (or binary) method, which requires no 
precomputations. For a random multiplier k^ this algorithm performs on average 
I log 2 k field multiplications and | log 2 k field inversions in affine coordinates, 
and 8^ log 2 k field multiplications in projective coordinates. 

Several proposed generalizations of the binary method (for exponentiation in 
a multiplicative group), such as the A:-ary method, the signed window method, 
can be extended to compute elliptic scalar multiplications over a finite field 
[11]. These algorithms are based on the use of precomputation and methods for 
recoding the multiplier. In [3], several algorithms are analyzed under various 
conditions. However, most of the proposed optimizations may not be worthwhile 
when memory is at a premium. 

Some special classes of elliptic curves defined over GF{2^) allow efficient 
implementations. For anomalous curves, the fastest known algorithm to compute 
kP is given in [17]; for curves defined over small subfields, efficient algorithms 
are presented in [13]. 

In [4,16,7] some techniques are presented for accelerating methods such as 
A:-ary and window based methods. These methods are suitable for software im- 
plementation of random elliptic curves over GF{2'^). 

A different approach for computing kP was introduced by Montgomery [8]. 
This approach is based on the binary method and the observation that the x- 
coordinate of the sum of two points whose difference is known can be computed in 
terms of the x-coordinates of the involved points. This method uses the following 
variant of the binary method: 
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Input: An integer k > 0 and a point P. 
Output: Q = kP. 

1. Set k ^ {ki-i . . . kiko )2 • 

2 . Set Pi ^ P, P 2 ^2P. 

3 . for i from I — 2 downto 0 do 

if ki = 1 then 



Set Pi 4- 


- Pi F P 2 , P 2 ^ 


1 

to 


Set P 2 4- 


- P 2 F Pi, Pi ^ 


1 

to 



4. return (Q = Pi ) . 



Fig. 1. Algorithm 1: Binary Method 



Note that this method maintains the invariant relationship P 2 — P\ = P^ 
and performs an addition and a doubling in each iteration. In [9] , Montgomery’s 
method was applied for reducing the number of registers needed to add points in 
supersingular curves over GF(2^). However, the authors observed that the be- 
nefits in storage provided by Montgomery’s method is at a considerable expense 
of speed. 

From the point of view of hardware implementation of elliptic curves over 
GF{2'^)j few papers have discussed efficient methods for computing kP. In [1], 
Montgomery’s method was adapted for non-supersingular elliptic curves over 
GF{2'^). However, the formulas given for implementing each iteration are not 
efficient in terms of field multiplications. 

In this paper we will present an efficient implementation of Montgomery’s 
method for computing kP on non-supersingular elliptic curves over GF{2'^). 

The remainder of the paper is organized as follows. In Section 3 we present 
a short introduction to elliptic curves over GF(2^). The proposed algorithm 
is described and analyzed in Section 4. Some running times of the proposed 
algorithm based on LiDIA are presented in Section 5. An implementation of the 
proposed algorithm is given in the appendix. 

3 Elliptic Curves over GF(2'^) 

Here we present a brief introduction to elliptic curves; more information on 
elliptic curves over finite fields of characteristic two can be found in [10,14]. Let 
GF{2^) be a finite field of characteristic two. A non-supersingular elliptic curve 
E over GF{2^) is defined to be the set of solutions (x,y) G GF{2^) x GF{2^) 
to the equation, 

y‘^ F xy = F ax‘^ F b , 

where a and b G GF( 2 "^), 6 7 ^ 0 , together with the point at infinity denoted by 

O. 

It is well known that E forms a commutative finite group, with O as the 
group identity, under the addition operation known as the “tangent and chord 
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method” . Explicit rational formulas for the addition rule involve several arithme- 
tic operations (adding, squaring, multiplication and inversion) in the underlying 
finite field. Formulas for adding two points in projective coordinates can be fo- 
und in [10,7]. In affine coordinates, the elliptic group operation is given by the 
following. Let P = G E; then —P = (xi,xi -h yi). For all P e E, O P 

p = p^O = p^lfQ = {x 2,V2) e E and Q ^ -P, then PpQ = (xs.ys), 
where 



and 



Xs 



(xi + Xi +X2 +Xi+^2 + a 

+ , P = Q. 



( 1 ) 



\^xl + {Xi + ^)X 3 +X 3 , P=Q. 

Notice that the x-coordinate of 2 P does not involve the y-coordinate of P. This 
observation will be used in the derivation of the improved method. 



4 Improved Method 

This section describes the improved method for computing kP. We first develop 
an algorithm in affine coordinates which requires two field inversions in each ite- 
ration. Next a “projective” version is presented with more field multiplications, 
but with only one field inversion at the end of the computation. 

4.1 Affine Version 

The extension of Montgomery’s method [8] to elliptic curves over GF{2'^) re- 
quires formulas for implementing Step 3 of Algorithm 1. In what follows we give 
efficient formulas that use only the x-coordinates of iA, P 2 and P for performing 
the arithmetic operations needed in Algorithm 1. At the end of the Ith iteration 
of Algorithm 1, we obtain the x-coordinates of kP and {kFPjP. We also provide 
a simple formula for recovering the y-coordinate of kP. 

The following lemma gives another formula for computing the x-coordinate 
of the addition of two different points. 

Lemma 1 Let Pi = (xi,yi), and P2 = (^2:Z/2) elliptic points. Then the 
x-coordinate of Pi + P2, X3, can he computed as follows. 

Xiy2 + X2Vl + Xixl + X2xl 

■*'» = (J7T1SP ■ 

Proof. Since Pi and P 2 are elliptic points, it follows that Pi Fxipi Fx 2 V 2 + 
x\f x\ =0. The result then follows easily from formula (1). 

The following lemma shows how to compute the x-coordinate for the addition 
of two points whose difference is known. 
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Lemma 2 Let F = {x,y),Fi = and F2 = (^2,^2) elliptic points. 

Assume that F2 = F\-\-F. Then the x-coordinate of F\-\-F2, x^,, can he computed 
in terms of the x> coordinates of F^ Fi and F2 as follows. 



X3 



^ ^ ^Xi X2^ X I X2 7 ^ ^2 

, Fi = F2. 

x\ 



( 4 ) 



Proof. The case F = O follows directly from (1). Applying formula ( 3 ), we 
obtain that the x-coordinate of F2 + F\ can be rewritten as 



X3 = 



^lZ/2 + ^2Z/1 + ^1^2 + ^2^? 



(Xi + X2)^ 

Similarly, the x-coordinate of F2 — Fi satisfies 



( 5 ) 



_ _ XiP2 + X2(xi + yi) + X1X2 + X2X1 

(xi+X2)^ * 

The result follows from adding ( 5 ) and (6). 

The next lemma allows one to compute the y-coordinate of Fi when F and 
the x-coordinates of Fi and Fi F F are known. 

Lemma 3 Let F = (x,y),Pi = (xi,yi), and F2 = (x2,y2) he elliptic points. 
Assume that F2 = Fi F F and x 7^ 0 . Then the y -coordinate of Fi can he 
expressed in terms of F, and the x-coordinates of Fi and F2 as follows. 

yi = (^1 + + x){x2 + x) + x^ + y}/x Fy . ( 7 ) 

Proof. Since F2 = Fi F P, we obtain from ( 3 ) that y\ satisfies the following 
equation: 

^2(^1 + • 

Therefore, 



xyi = X2X^ -h X2X^ -h xiy + xix^ F xx^ 

= xi{xiX 2 + xix F x‘^ Fy} F x{xx 2 | 

= Xi{xiX2 + Xix F x‘^ F XX 2 + x^ + y} 

-h x{xiX2 + Xix + XX2 Fy} F xy 
= (xi -h x){(xi + x)(x2 + x) -h x^ + y} + xy. 

The following algorithm, based on Lemmas 2 and 3 , implements Montgo- 
mery’s method in affine coordinates. 
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Input: An integer k >0 and a point P = (x,y) G E. 
Output: Q = kP. 



1 . 

2 . 



3. 

4. 



5. 

6 . 

7. 



if /c = 0 or x = 0 then output (0,0) and stop. 
Set k ^ {ki-i . . . kiko )2 . 

Set xi ^ X, X 2 ^ + b/x^ . 

for i from I — 2 downto 0 do 



Qot; f 4 ^1 

^ ^ Xl + X^ 

if ki = 1 then 



Set xi ^ X + + t, X2 ^ x| + b/x2- 



else 



Set xi ^ Xi + 6/xf , 



X2 — X p p t. 



Set ri ^ xi + X, T 2 ^ X 2 P X . 
Set yi ^ n (rir2 Px‘^ Py)/xPy 

return (Q = (xi,yi)) . 



Fig. 2. Algorithm 2 A: Montgomery Scalar Multiplication 



Observe that Algorithm 2A, in each iteration of Step 4, performs two field 
inversions, one general field multiplication, one multiplication by the constant 
6, two squarings, and four additions; it follows that the total number of field 
operations to compute kP is given in the following lemma: 

Lemma 4 For computing kP, Algorithm 2A takes exactly the following number 
of field operations in GF{2^): 

#INV, = 2 [log2 k\ Pi , #MULT. = 2 [log2 k\p4 , 
fj^ADD. = 4[log2 k\ P 6 , ^SQR. = 2[log2 k\ P 2. 

Remark. A further improvement to Algorithm 2A is to use an optimized routine 
to multiply by the constant b. Another potential improvement is to compute in 
parallel Xi and X 2 from Step 4, since these calculations are independent of each 
other. 



4.2 Projective Version 



When field inversion in GF{2'^) is relatively expensive (e.g., inversion based on 
Fermat’s theorem requires at least 7 multiplications in GF{2'^) if m > 128), 
then it may be of computational advantage to use fractional field arithmetic to 
perform elliptic curve calculations. 

Let P, Pi and P 2 be points on the curve E such that P 2 = Pi P P. Let the 
x-coordinate of Pi be represented by Xi/Zi^ for i G {1,2}. From Lemma 2, when 
the x-coordinate of 2Pi is converted to projective coordinates it becomes 
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x{2Fi) = Xf + b-Zf , 
z{2F) = Zf ■ Xf. 

Similarly, the x-coordinate of F\ + F 2 in projective coordinates can be computed 
as the fraction Xs/Zs, where 

I Zs= {Xi-Z 2 + X 2 -Zi)\ . 

\X3=X-Zs + {Xi-Z2)-{X2-Zi). 

The addition formula requires three general field multiplications, one multi- 
plication by X (i.e., the x-coordinate of which is fixed during the computation 
of kP)j one squaring and two additions; doubling requires one general field multi- 
plication, one multiplication by the constant 6, five squarings, and one addition. 
A method based on these formulas is described in the next algorithm. 



Fig. 3. Algorithm 2P: Montgomery Scalar Multiplication 



Input: An integer k > 0 and a point P = {x^y) E E. 

Output: Q = kP. 

1. ifA: = 0orx = 0 then output (0,0) and stop. 

2. Set A: ^ . . . A:iA:o )2 • 

3. Set Xi ^ X, ^ 1, X 2 ^ x^ -h 6, Z 2 ^ x‘^ . 

4 . for i from I — 2 downto 0 do 

if = 1 then 

Madd(Ai, Zi,X 2,Z2), mouhle{X2,Z2). 
else 

Madd( A 2 , Z 2 , Ai , ) , Mdouble ( Ai , ) . 

5. returnCQ = Mxy(Ai,Zi, A 2 ,z^ 2 )) • 

An implementation of the procedures Madd, Mdouble and Mxy is given in the 
appendix. 

Lemma 5 Algorithm 2P performs exactly the following number of field opera- 
tions in GF{2^): 

#i7V\/. = 1 , #MULT.=6[log2k\plO , 

#ADD, = 3[log2 k\p7 , ffSQR. = 5[log2 k\ + 3. 

Remark. Since the complexity of both versions of Algorithm 2 does not depend 
on the number of I’s (or O’s) in the binary representation of A:, this may help 
to prevent timing attacks. On the other hand, the use of restricted multipliers 
(e.g., with small Hamming weight) does not speedup directly Algorithms 2A 
and 2P, and this is a disadvantage compared to methods such as the binary 
method. However, from a practical point of view, most protocols in cryptographic 
applications use random multipliers. 
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4.3 Complexity Comparison 

In the sequel, we assume that adding and squaring in GF{2'^) is relatively 
fast. Now we compare the complexities of the addition-subtraction method to 
the complexity of the proposed method. This is a fair comparison since both 
methods do not use precomputation. For a random multiplier the addition- 
subtraction method in projective coordinates, given in [14], performs 8.31og2 
field multiplications; it follows we expect Algorithm 2P to be about 28% fa- 
ster on average. However, if we use the formulas given in [7] for implementing 
the group operation in projective schemes. Algorithm 2P is about 14% faster 
than the addition-subtraction method. In the following table we summarize the 
complexities of these methods. 

Table 1. Complexity Comparison of Algorithm 2P with other algorithms (a = 0, 1). 



Method 


Projective Coordinates 


Binary [10] 


13 log2 k 


Add-Sub [14] 


8.31og2 k 


Add-sub[7] 


7 log2 k 


Algorithm 2P 


6 log2 k 



Now we derive the cost of the addition-subtraction method (using affine coor- 
dinates) in terms of field multiplications. As mentioned in Section 2, this method 
performs on average | log 2 k field multiplications and | log 2 k field inversions. 
Thus, the total cost is | (4r + 8) multiplications, where r is the cost-ratio of 
inversion to multiplication. This shows that for implementations of the finite 
field GF{2^) where r > 2.5 (see for example [1,19,4]), Algorithm 2P gives a 
computational advantage over the addition-subtraction method. 

5 Running Times 

In this section we present some running times we obtained in our software im- 
plementation of the proposed algorithm over the finite fields GF{2^)^ where 
m = 163, 191 and 239. To represent the finite fields we used LiDIA [6], a C++ 
based library. This finite field implementation uses a polynomial basis represen- 
tation and the irreducible modulus is chosen as sparse as possible. We used a Sun 
UltraSPARC 300MHz machine. For comparison, we list in Table 2 the timings 
for the basic arithmetic operations in GF{2'^). 

Notice that one field inverse costs more than 9 field multiplications; there- 
fore, the use of LiDIA may illustrate the performance of the proposed algorithm 
in situations where a field inverse is relatively expensive compared to field mul- 
tiplication. 

In Table 3 we present average running times for computing a scalar multi- 
plication using several methods. These values were obtained using the following 
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Table 2. Average running times (in microseconds) for GF(2^) using LiDIA. 



Extension m 


Add. 


Sqr. 


Mult. 


Inv. 


163 


0.6 


2.3 


10.5 


96.2 


191 


0.7 


2.0 


10.9 


118.1 


239 


0.8 


2.6 


14.6 


162.8 



Table 3. Average running times (in milliseconds) for computing mP. 



Extension m 


Binary [10] 


Add-Sub. [14] 


Algorithm 2P 


163 


27.5 


19.1 


13.5 


191 


33.1 


22.4 


16.0 


239 


52.3 


35.1 


25.6 



test: we select 10 random elliptic curves (a = 0) over GF(2^), then we mul- 
tiply a random point P in each curve with 100 randomly chosen integers of 
size < 2'^. We implemented the binary method in projective coordinates (see 
[10]), the addition-subtraction method [14] and Algorithm 2P. From Table 3 
we conclude that the proposed method on average is 27-29% faster than the 
addition-subtraction method and 51% faster than the binary method. These ti- 
mings show that the theoretical improvement of Algorithm 2P, given in Table 1, 
is observed in a actual implementation. 

6 Conclusion 

In this paper, we have presented an efficient method for computing elliptic scalar 
multiplications, which is an optimized version of an algorithm presented in [1]. 
The method performs exactly 6[log2 + 10 field multiplication for computing 
kP on elliptic curves selected at random, is easy to implement in both hardware 
and software, requires no precomputations, works for any implementation of 
GF(2^)^ is faster than the addition-subtraction method on average, and uses 
fewer registers than methods based on projective schemes. Therefore, the method 
appears useful for applications of elliptic curves in constraint environments such 
as mobile devices and smart cards. 
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8 Appendix 

Mdouble (Doubling algorithm) 

Input: the finite field GF{2'^); the field elements a and c = 6^”^ ^ = b) 

defining a curve E over GF{2'^); the ^-coordinate X/Z for a point F. 

Output: the x-coordinate X/Z for the point 2F. 

1. 

2 . 

3. Z ^ ^2 

4. Ti ^ Z X Ti 

5. Z ^ Z X X 

6. li ^ i 2 

7. 

8 . X^XfTi 

This algorithm requires one general field multiplication, one field multiplication 
by the constant c, four field squarings and one temporary variable. 

Madd (Adding algorithm) 

Input: the finite field GF{2^); the field elements a and b defining a curve E 
over GF{2^); the x-coordinate of the point F; the x-coordinates Xi/Zi and 
X 2 IZ 2 for the points F\ and F 2 on E. 

Output: The x-coordinate Xi/Zi for the point Fi F F 2 . 

1 . 4 1 i — X 

2. Xi <r- Xi X Z 2 

3. Z^^ i — Z^^ X X 2 

4. F 2 ^ Xi X Z\ 

5. Z\ i — Z\ F X\ 

6 . ^ ^2 

7. X\ i — Z\ X li 

8. Xi ^ Xi F T 2 

This algorithm requires three general field multiplications, one field multiplica- 
tion by X, one field squaring and two temporary variables. 
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Mxy (AfRne coordinates) 

Input: the finite field GF{ 2 ^); the affine coordinates of the point F = 
the ^-coordinates XxjZx and for the points and F^- 

Output: The affine coordinates = (X2^Z2) for the point Fi. 



1. if = 0 then output (0,0) and stop. 

2. if ^2 = 0 then output (x,x -hy) and stop. 

3. J-\ i — X 

4. T 2 ^ y 

5. is Z\ X Z 2 

6. Z\ A — Z\ X -/ 1 

7. ^ 

8 . Z2 ^ Z2 X ii 

9. Xi ^ Z 2 X Xi 

10. Z2 F- Z2 + X2 

11. Z 2 A — Z 2 X Zi 

12 . 

13. I 4 A — I 4 12 

14. _/4 A — X is 

15. A — Z 2 

16. is ^ i’s X ii 

17. Ts F- inverse(Ts) 

18. i4 A— F^, X i4 

19. X 2 At- X\ X i’s 

20. Z 2 F- X 2 + ii 

21. Z2 A — Z2 X 1 /^ 

22. Z2 A — Z2 T I2 



This algorithm requires one field inversion, ten general field multiplications, one 
field squaring and four temporary variables. 
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Abstract. Recently, a novel public-key cryptosystem constructed on 
number fields is presented. The prominent theoretical property of the 
public-key cryptosystem is a quadratic decryption bit complexity of the 
public key, which consists of only simple fast arithmetical operations. We 
call the cryptosystem NIGE (New Ideal Goset Encryption). In this paper, 
we consider practical aspects of the NIGE cryptosystem. Our implemen- 
tation in software shows that the decryption time of NIGE is comparably 
as fast as the encryption time of the RSA cryptosystem with e = 2^® T 1- 
To show if existing smart cards can be used, we implemented the NIGE 
cryptosystem using a smart card designed for the RSA cryptosystem. 
Our result shows that the decryption time of NIGE is comparably as 
fast as the decryption time of RSA cryptosystem but not so fast as in 
software implementation. We discuss the reasons for this and indicate 
requirements for smartcard designers to achieve fast implementation on 
smartcards. 

Key w'ords: public-key cryptosystem, fast decryption, quadratic order, 
smart card implementation. 



1 Why NICE? 

Plenty of public-key cryptosystem not relying only on the RSA cryptosystem 
have been proposed. They are stemming from deep number theory (hyper- and 
superelliptic curves) to geometry and combinatorics (LLL-based systems). One 
major advantage of RSA is its simplicity: it can be easily implemented, one only 
needs a moderate background in mathematics to understand it. Moreover, RSA 
is quite fast - there exist public key cryptosystems which are much faster, but the 
combination of simplicity, speed and confidence in its security makes it the most 
practical cryptosystem. Until now, only one other system may be considered as 
equally interesting: ElGamal type systems on elliptic curves. Elliptic curves are 
somewhat more complicated, and the best known algorithms to break elliptic 
curve cryptosystems are much slower, in the order of exponential complexity. 
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So, both speed and security are worth being paid by the higher mathematical 
complexity. 

But: both systems have one drawback in common: the decryption/signing 
time is of cubic complexity in the bit length of the public key for both systems 
because these steps consist of modular multiplication(s) . This becomes even more 
important when thinking of smart cards. The smart card is considered to be the 
personal security computing device of tomorrow. It contains all personal secret 
information, especially the private keys for public key systems like decryption 
and signing. The complexity of PC operating systems is too high to be reliably 
secure; every relevant security operation should be effected by the smart card. 
Non-relevant operations like public key encryption and signature verification 
could be done by the PC. Operations which cannot be transferred to the PC 
at all due to security reasons are decryption and signing. Considering RSA, the 
task which has to be effected by a low power computing device is precisely the 
most complex task. Moreover, we can expect that the key length of the public 
key will increase with the progress of hardware technology. In addition, there are 
no guarantees that new sub-exponential attacks for the basic number theoretic 
problem will not be suddenly proposed. Therefore it would be better to have a 
more efficient public key cryptosystem. 

As an alternative, we might use a new public-key cryptosystem constructed 
over number fields [18]. The cryptosystem has a theoretically fast decryption 
process such as a quadratic decryption complexity of bit-length of a public-key, 
which consists of only simple fast arithmetical operations. So even if the key 
length gets bigger in the future, there will be no great increase of the computa- 
tional complexity. This becomes even more important when thinking of smart 
cards. In this paper, we call the new cryptosystem NICE, (New Ideal Coset 
Encryption). We focus on the practical aspects of NICE cryptosystem. We im- 
plement the NICE cryptosystem over different architectures, namely software 
on a standard PC and on a smart card designed for the RSA cryptosystem. Our 
implementation in software shows NICE is as fast as the encryption time of the 
RSA cryptosystem with e = 2^® + 1. Implementation on a smartcard designed 
for the RSA cryptosystem is comparably as fast as the decryption time of the 
RSA cryptosystem but not so fast as in software implementation. We discuss 
the reasons for this and indicate requirements to achieve fast implementation on 
smart cards. 

This paper is organized as follows: In section 2, we give several applicati- 
ons based on NICE cryptosystem. In section 3, we explain the details of the 
algorithms of the NICE cryptosystem. In section 4, we show timings of the im- 
plementation in software. In section 5, we discuss a smart card implementation 
and its problems. 

2 Applications 



In section 3, we will present NICE in the formulation of an encryption scheme. An 
immediate application is therefore session kev distribution from a powerful server 
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to a device which has a limited computing power or where time is important. 
An example for such a device is e.g. a mobile phone. 

Another application of NICE is the use as an authentication scheme. The 
usual protocols (2-way and 3-way) can be adapted to use NICE as encryption 
component. Of course, it could be combined with RSA, where the modulus is 
the absolute value of the discriminant of the non-maximal order, i.e. Aq. In that 
case, the 3-way protocol can be realized in such a way that the client (= the low 
computing power device) only effects the fast components of both algorithms. 

As a last application, we propose an undeniable signature scheme. NICE its- 
elf cannot be used as a classical signature scheme; undeniable signature schemes 
have their own use e.g. in online transactions. A signature of this kind cannot be 
verified without the interaction of the signer. The standard example for its appli- 
cation is its use by a software development company: the distributed software is 
signed by means of an undeniable signature of the company to allow legal users 
to ensure themselves that they use unmodified software. Since interaction with 
the seller is needed to check the signature, illegal users either cannot check and 
risk to use some virus-infected software or will be traced by the software-seller as 
soon as they ask for interactive verification. Details can be found in [4]. Again, 
the low computing power device only effectuates the NICE decryption steps, so 
smart cards can be used for assuring the security of online transactions. 



3 The NICE Cryptosystem 

In this section, we present an overview of the NICE cryptosystem. Details can be 
found in [18]. The idea of NICE is roughly as follows: consider two finite abelian 
groups G and H which are related by a surjective map tv : G ^ H . Moreover, 
there exists a well-defined bijective mapping of sets (j) : H ^ U of H onto a 
subset of C of G such that 7 t(0(A4))) = M for all A4 E H. The representation 
of elements of G and the group operation algorithm of G are publicly known, 
as well as an element h of the kernel of tt. U is chosen such that a consecutive 
set of representations of elements of G are representations of elements of U, 
This information is publicly known. Assume that you know the group H (i.e. 
representation of group elements and group operation) and how to compute tt, 
but no one else does. The message space consists of the publicly known elements 
of U . Now, a message m is probabilistically encrypted by randomly multiplying 
an element of Keryr onto it: the ciphertext is c = m ^ h'^ \ Decryption simply 
works as follows: compute ^ 

This is a secure cryptosystem if the computation of the map tt cannot be 
deduced from the given information, namely the group G, the kernel element h 
and the test for U . There exist some constructions of this scheme using number 
theoretic problems, e.g. [17]. An overview can be found in [19]. 

The following implementation of this scheme is especially interesting: Gene- 
rate two random primes p, g > 4 such that p = 3 (mod 4) and let Ai = — p. Let 
H = Gl[Ai) be the ideal class group of the maximal order with discriminant 
A^ and G = GKAn) be the ideal class group of the non-maximal order with 
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conductor q, Aq will be public, whilst its factorization into Ai and q will be 
kept private. 

Any element of G is given by a pair of numbers (a, 6) such that 0 < a < 
y/|Z\g|/3, —a < b < a and = zA^modda (and some other minor requirements). 
Elements of H are represented in the same way with A\ instead of Aq. The 
group operation works as follows: 

Composition in Cl(Z\g) 

Input: (ai, ^i), (^2, 62) C Cl(Z\g), the discriminant Aq 

Output: (a, 6) = (ai,6i) ^ (a2,62). 

1. /* Multiplication step */ 

1.1. Solve d = uai + v(i 2 + w[bi + 62) /2 for d^u^v/w G using the extended 
Euclidean algorithm 

1 . 2 . (I i — (11(12/ d‘^ 

1.3. 6 ^ 62 + (^^2(61 — 62) + w{Aq — ^2)/^) /<^niod2a 
2 /* Reduction step */ 

2 . 1 . {Aq-b‘^)/Aa 

2.2. WHILE {—a < b < a < c} or {0 < b < a = c} DO 

2.2.1. Find s.t. —a < p = b-\~2Xa < a using division with remainder 

2.2.2. (a, 6, c) ^ (c — — /i, a) 

2.3. IF a = c AND 6 < 0 THEN b ^ -b 

2.4. RETURN (a, 6) 

This algorithm has quadratic bit complexity 0((log2 AqY) and is only needed 
for the encryption step. For the decryption step, we need only to compute tt. 
The computation of the map tt works as follows: 

Computation of tt 

Input: (a, 6) G the fundamental discriminant A\, the discriminant Aq 

and the conductor q 

Output: {A^B) = 7r((a,6)). 

1. bo ^ Aq mod 2 

2. Solve 1 = uq va for u^v E TL using the extended Euclidean algorithm 

3. S ^ T abov mod 2a 

4. C ^{Ai-B^)/^A 

5. WHILE NOT {{-A < B < A < C} or {0 < B < A = C}) DO 

5.1 Find A,/x G Zfj s.t. —A < p = B 2XA < A using division with remainder 

5.2 {A,B,C) ^{C-X^,-li,A) 

6. \f A = C AND B <0 THEN B ^ -B 

7. RETURN {A,B) 

This algorithm tt has quadratic bit complexity 0((log2 ([18]). Moreo- 

ver, only simple well-known operations are needed, thus this algorithm can easily 
be implemented on an existing: smart card. 
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Finally, the test belonging to U is simply whether for an element (a, 6) a is 
smaller than [^|Z\i|/4] (or a lower bound thereof of the form 2^). The compu- 
tation of the map (j) will not be needed in that implementation. 

We explain how the NICE encrytion scheme works: In the key generation, 
we choose an element (h, 6/^) from the kernel Keryr and make (h, 6/^) public. 
The message m is embedded in an element (m,6^) of Cl[Aq) with m smaller 
than [^|Z\i|/4]. Encryption is done in the class group Cl{Aq) by computing 
(c, 6c) = where r is a random integer smaller than 2^ with I = 

log 2 (q — j j • Then, having the secret information, namely the knowledge of 
the conductor g, one can go to the maximal order and the image of the message 
{rn^bm,) in the maximal order is revealed, since 7t(c, 6c) = 7r(m,6^^) and rn can 
be recovered without computing (j). 

The NICE encryption protocol 

1. Key generation: Generate two random primes p, g > 4 with p = 3 (mod 4) 
and yp4 < q. Let Ai = — p and Aq = Ai(Y . Let k and I be the bit lengths 

of [^|Z\i|/4] and q— respectively. Choose an element {h^h) in Cl{Aq), 

where 

7T{{hM)) = M ( 1 ) 

Then ((h, 6/^), zA^, /c, /) are the system parameters, and q is the secret key. 

2. Encryption: Let (m,6^) be the plaintext, in Cl{Aq) with log 2 m < k. Pick up 
a random / — 1 bit integer and we encrypt the plaintext as follows using binary 
exponentiation and precomputation techniques: 

(c,6c) = (m,6^) ^ {h,bhY (2) 

Then {c^bY} is the ciphertext. 

3. Decryption: Using the secret key g, we compute (d, bY) = 7t((c, 6c)). The plain- 
text is then rn = d. 

A message embedding technique and security aspects of this cryptosystem, we 
refer to [18] . Again, please note that this cryptosystem can easily be implemented 
using well-known techniques and existing smart cards. This will be shown in the 
next section. 

4 NICE Running Times in Software 

The prominent property of the proposed cryptosystem is the running time of the 
decryption. Most prominent cryptosystems require decryption time 0((log2 ?^)^), 
where n is the size of the public key. The total running time of the decryption 
process of our cryptosystem is 0((log2 AqY) bit operations. In order to demon- 
strate the improved efficiency of our decryption, we implemented our scheme 
using the LiDIA library [2]. It should be emphasized here that our implemen- 
tation was not optimized for crvptographic purposes — it is onlv intended to 
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provide a comparison between RSA and NICE. The results are shown in table 
1. In these tests, we did choose p ^ q ^ so that breaking RSA and NICE 
by factoring is approximately equally hard. Other variations and a discussion of 
the variants can be found in [18]. 



Table 1. Average timings for the new cryptosystem with 80-bit encryption exponent 
r compared to RSA with encryption exponent 2^® + 1 over 100 randomly chosen pairs 
of primes of the specified size on a Pentium 266 Mhz using the LiDIA library 



l0g2(A) 


1024 


1536 


2048 


3072 


RSA encryption 

RSA classical decryption 

RSA decryption with CRT 


2.2 ms 
259.2 ms 
110.5 ms 


4.8 ms 
751.3 ms 
291.7 ms 


7.5 ms 
1643.9 ms 
629.7 ms 


15.2 ms 
4975.6 ms 
1855.5 ms 


NICE encryption 
NICE decryption 


602.9 ms 
3.8 ms 


1180.1 ms 
6.2 ms 


1902.0 ms 
10.0 ms 


3933.5 ms 
19.3 ms 



Observe that one can separate the fast exponentiation step of the encryption 
as a “precomputation” stage. Indeed, if we can securely store the values (p,6p)”, 
then the actual encryption can be effected very rapidly, since it requires only 
one ideal multiplication and one ideal reduction. Moreover, using well-known 
techniques for randomized encryption, we can even reduce the encryption time 
much more. Note that no square root technique like the Pollard-rho method or 
Shanks’ algorithm are directly applicable to the ciphertext (c, 6^), because the 
encryption consists of (c, 6c) = 6^^)(]?, 6p)” where r is a random exponent and 

(m, 6^) is the secret plaintext. This means that we can use a very short random 
exponent r having e.g. about 80 bits. 

It should be mentioned that the size of a message for our cryptosystem is 
significantly smaller than the size of a message for the RSA encryption (e.g. 
256 bit vs. 768 bit, or 341 bit vs. 1024 bit). In connection with the very fast 
decryption time, an excellent purpose for our cryptosystem could be (symmetric) 
key distribution. In that setting, the short message length is not a real drawback. 
On the other hand, the message length is longer than for ElGamal encryption 
on “comparably” secure elliptic curves (e.g. 341 bit vs. 180 bit). 



Table 2. Rate of the speed increasing when the bit-length of a public-key becomes 
larger 



loga («■) 


1024 


1536 


2048 


3072 


RSA encryption 


1 


2.18 


3.41 


6.91 


RSA classical decryption 


1 


2.90 


6.34 


19.20 


RSA decryption with CRT 


1 


2.64 


5.70 


16.79 


NICE encryption 


1 


1.96 


3.15 


6.52 


NICE decryption 


1 


1.63 


2.63 


5.08 






334 M. Hartmann, S. Paulus, and T. Takagi 



Note that even if a public-key becomes large, the rate at which the speed of 
decryption of NICE increases is not so large as that of the RSA cryptosystem. 
This shows the effectiveness of a quadratic decryption time of NICE cryptosy- 
stem. The ratio is given in table 2. 



5 A Smart Card Implementation and Its Problems 

Moreover, we implemented the NICE decryption on a smart card. More precisely, 
we implemented the NICE decryption algorithm using the Siemens development 
kit for chip card controller ICs based on Keil PK51 for Windows. As assembler 
we used A51, as linker L51 to generate code for the 8051 microcontroller family. 
The software simulation were made using dScope-51 for Windows and the drivers 
for SEE 66CX160S. Thus, we realized a software emulation of an assembler 
implementation of the NICE decryption algorithm to be run on the existing 
Siemens SEE 66CX160S. Unfortunately, the timings of this software simulation 
were unrealistic. So, we did run some timings on a hardware simulator for SEE 
66CX160S. Thanks to Deutsche Telekom AG, Produktzentrum Telesec in Siegen 
and Infineon/Siemens in Munich for letting us use their hardware simulator. See 
the timings for decryption in table 3 for a smart card running at 4.915 MHz. 



Table 3. Timings for the decryption of the new cryptosystem compared to RSA using 
the hardware simulator of Siemens 66CX160S at 4.915 MHz 



logs (jl) 


1024 


RSA decryption with CRT 


490 ms 


New CS decryption for p ^ q ^ 


1242 ms 


Improved version 


1035 ms 



The very first implementation was very inefficient; the straightforward algo- 
rithms used in the software comparison proved to be much slower on the smart 
card than the existing RSA on the card. This was surprising, but after a while 
this could be easily explained: the cryptographic coprocessor has been optimized 
for modular exponentiation. On the other side, NICE uses mostly divisions with 
remainder and comparisons. These operations are slow on the coprocessor, so we 
had to modify the decryption algorithm to speed it up in hardware. We describe 
here two significant changes: 

The computation of the inverse of a modulo q (step 3 in the computation of 
7t) using the extended Euclidean algorithm took (with q ^ p ^ 341 bit) about 
9 seconds (!), whereas computing the inverse using Eermat’s little theorem - 
by using fast exponentiation mod q - took less than 1 second. Note that the 
decryption time using this method is no longer of quadratic complexity. 

In the reduction process the quotient in the division with remainder step is 
most of the time verv small fsav < 10, see Appendix Ah to effect a division is 
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this case is much more time consuming than subsequent subtractions. We did 
replace reduction step 5 

5.1 Find A,/x G S s.t. —A < ju = B A 2XA < A using division with remainder 

5.2 {A,B,C) 

by the following algorithm. 

5.1 WHILE < -^ OR E > ^ DO 

5.1.1 IF E<0THEN + + 

5.1.2 ELSE B ^B-2A\C ^C-{B-Ay, 

Every time that the bitlength of B was exceeding the bitlength of variant 
B by at least 3. Using this improvement, we could decrease the running time 
from about 1.8 s to 1.2 s. This is already faster than a 1024 bit RSA decryption 
without Chinese remainder theorem (approx. 1.6 ms). 

The timings in table 3 were made including these two improvements. Moreo- 
ver, a detailed timing analysis in Siegen showed that both our static memory 
management and as well as the cryptographic coprocessor are not optimal for 
this algorithm. We discuss this in the sequel. An average overview of the most 
time consuming parts is given in table 4. 



Table 4. Detailed timings for different functions in the decryption of the new crypto- 
system 



Function 


Average time over the whole computation 


mul, multiplications on the coprocessor 


170 ms 


div, divisions on the coprocessor 


231 ms 


left.adjust, length correction of the variables 


376 ms 


C2XL, moving numbers into the coprocessor 


118 ms 


XL2C, moving numbers out of the coprocessor 


223 ms 


others (comparisons, small operations) 


114 ms 


Overall time 


1242 ms 



One major difference between RSA and NICE is the number of variables 
needed during the computation of the decryption algorithm. In our implemen- 
tation, we need to store 11 variables of length at most 2048 bit. Computations 
of the cryptographic coprocessor are shortening these variables. Thus, we had 
to adjust the length of the variables after each important operation. This was 
done by moving the top nonzero bytes of the number to the fixed address of the 
number and so "erasing” leading zero bytes. To do this, we used the cryptogra- 
phic coprocessor. Now the exact timings showed that about 33 % of the running 
time is spent by the function left .adjust, which effects this correction. 

Now changing the memory management from static to dynamic (i.e. in the 
XL2C and C2XL functions making the appropriate changes and having additio- 
nallv some registers holding the starting address of the numbers), we got an 
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improvement to 637 ms using the software simulator. Both Infineon/Siemens 
and Deutsche Telekom reported the overall time of the hardware simulator now 
to be 1035 ms. Note that the amount of memory required is 965 Bytes and 
thus fits into a real SLE 66CX160S. 

As one can see from table 4, another important time consuming operation 
is to move numbers into and out of the coprocessor. At this point, we would 
get a speedup of about 350 ms if we could leave the numbers in registers inside 
the coprocessor. It is clear that the currently used processor is not prepared for 
such operations, since it is optimized for RSA, thus operations with very few 
variables. At this point we ask the hardware community to present solutions to 
this problem. 

6 Conclusion and Acknowledgements 

The NICE cryptosystem is fast and well suited for software implementation. To 
get an equally fast speedup compared to RSA on a smart card, we think that the 
underlying hardware must be developed adequately. Nevertheless, if this is done, 
the think that NICE can be a competitor to RSA whenever fast decryption is 
needed. 

We thank Deutsche Telekom AG, Produktzentrum Telesec for letting us 
testing NICE on the hardware simulator and Siemens AG/Infineon GmbH for 
their valuable help concerning the use of the development kit as well as running 
our code on their hardware simulator. 
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A An Example of the Reduction Step 



Input: discriminant Ai, an ideal 21 = {A,B) before the reduction step 

D1 = -3919553298811157368475523990994777486712879620160636686074980926686635774565281654416535613225312452663 
A = 1470919698723621747031034479033655458221345099558043935477761485226131553543707299391756504474531255189264719094757 

659808191301659419506155735441417414491 

B = -1612078661117333293617543382531888102723531401772217996101303796600583164587338163291108859457317212840954820324227 
661206969587778502329803762655721150213 



Reduction step 5.1.: quotient A, remainder /x such that B = 2 AX T F 

( quot i eiit , r ema inde r ) = 

(-1, 132976073632991020044452557553542281371915879734386987485421917385167994250007643549240414949174529753757461786528 

7658409413015540336682507708227113678769) 

(-2, -127612435903271747284203940962563047288319779794024205168731345688046856992818597251036644668758126413199147109490 

159283556463934886789503896413069632105) 

(5, -785282900112624573884568439413292485034544440747434937937831117600068351530406923212045847152018597171039409030826 

1868661757907073551937411599271404065) 

(3, 102448189335108917873708106129511177370194888095368169144208661124365741713523215381199419560853590747848324434047 

4403297520916097275928966085178754669) 

(-2, -102241298455901475013823651184942662520218930922446297120610589090686808521973876143868942269542228647930772313523 

574165905118338106550708986943292605) 

(5, -111052964327862543248014454108681029972621611840532971461250154942477651873609548116068054690931838077804710928090 

93473562855601861999949481990650315) 

(2, 224721768750757491766093041444587045099908536747902569504729305667589920231890466102920134589953378731011273003071 

265369685205138129287433281611151) 

(-24, -193997682622386160747509256203459457001302400743643940157769391868127988127520988615996593055854386819648159017406 
4685369029295289498715467235503) 

(5. -872140247558161527297502021585254450313036153689333922978271792128743615045351839413078660868168505225167354418491 
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91048830504132818556443763927 ) 

( 5 , -658925615422042834876348480484029082058507054052681386209011509962105060677563908394232616059830404811010727301359 

3512069684208466784180774933 ) 

( 3 , -353725409650362059341413405754405045233330716598185927166581938071831594678230116338852933339930067189465564000769 

287233896579646995020989815 ) 

( 7 , -247222284294593727142489621540238650242932541449160633234337068881461071113337993384784406154327323546383475378203 

76474826892936909497877111 ) 

( 2 , 211238960823671342461903216289647740521689040736096428971091833115479874530277048960194279957267650917082000952992 

7566081981468920686945135 ) 

(- 5 , -138828499610837630682998254650680087729785399344104356860281267036927461979957572311938410081908330045415812611779 

611796816538719307405505 ) 

( 3 , -765811450988547423144046883303894668462310641251449515590759557414197782566477229459016888940284991233814769828608 

3138580548663993732859 ) 

( 6 , 451715975973337636239923655622704513018369560900086413869204962835152615559671858853242965579677357251079839392053 

844630453417984392931 ) 

(- 3 , 579493992911536599517350321866083438435686742562973463372479826459392480226779856939226160460117478967847924671753 

92968596489710481213 ) 

(- 3 , 135067752525885105483994559664721938559177175680541689433450130455304919692737374065720006258404545809371465593318 

9049738778828983289 ) 

(- 15 , 337193319312295431745969016664643151387915479765592068235657385273215560443110982269888116148705401309431343138225 

96913769977980851 ) 

(- 3 , 323878473831343845968960745512074363754506782760808676681774080723182382322942699332100336805150642284551592326904 

4296100266373941 ) 

(- 4 , 167144197381464431967528225738718542314051660172099527690831373123590316239808571782330789831021939891807455958096 

732542876200243 ) 

(- 5 , -309390011597559593159761073745581970668200194635499292322116436596067679796830203942404937063400931822640702431265 

3906939082883 ) 

( 11 , -115310685805821673604505436475263765545172934158683202462769686123589664701441798228728715241781657550421817808920 

121862424193 ) 

( 3 , -214164778849294341813886319240237413612645988098168730763311338303186025963320880697339330474638842906957278383121 

52170109453 ) 

( 2 , 128885235911603341061379124399416424159190535507373563424716539280561837794601490501570150868125552599247818531014 

6905542061 ) 

(- 8 , 316335781594392130649537991526978632041159226767264341022335809622254943347009615572076976512856499763471626857986 

37456771 ) 

(- 5 , -132100188970518808134989273741019953395958169925083468958496091548029282207578810138129189350877886346393515471619 

9449891 ) 

( 5 , -118207678855173060273461976677047511666107495356238233325850746777448433952887041962223444468735121774461329878842 

257829 ) 

( 2 , 211192794617132775962070178935022665383131433119630308575044151078117101318282538223456317584240935159085658194389 

75697 ) 

(- 2 , -274328530511417242554111054080256110572002014193891842660039774575578405921937962321580147605536058202955477454722 

6713 ) 

( 3 , 286074815889198540147442971313659278227472798937810428709777362705663473916998086767681817465542781829132249359517 

975 ) 

(- 3 , 136753299791205246544364486899154151347754882077985198546727042251978367689138220757139643994263442612011028174033 

09 ) 

(- 7 , - 573392565819784089015761647765609265475213202129944067089638531361517839793128894759675411629999386769288097157573 ) 

( 3 , 46420141247202899632758735492428308491655665018737859648769488246653936727190431310329578844089903814145126014811 ) 

(- 4 , 2648752977947845396817832353867014934646503864836612394943244361752704526516616394918528493321394574264607485573 ) 

(- 5 , 210855924586385522702291996875375564432834909148788263313616464995823236571191045770185801157558886696640500937 ) 

(- 3 , 22358943854775137747028532971526452088757418214703795026514539955938354527714495940754319443379591091154854811 ) 

(- 3 , - 3066428918730050632325097660734163737851572230223614656315369265785635066318979992409078183621174963776878039 ) 

( 2 , 142086598987407275566528223029847683110629990793367193207487961621845231249474450762716878333452768170371307 ) 

(- 10 , - 4013833186275051054261529002448266001854462865736283000730209262386311300176819714480584897953649188768987 ) 

( 3 , 513318782779814540099933414852657264974713493469374585453379543647835185456621560862523908562876553217709 ) 

(- 2 , - 61677436882563203815369113178313444194666063177138061203361907960930109397743003273918060562207208817421 ) 

( 4 , - 5705343812063530411004779368065864956629980006638652578479437486915684727604688458627505624819542303243 ) 

( 3 , - 91558187817393163469667420478915357008167385723230297957370355268833319053370140794607127532476127053 ) 

( 21 , 453868533685069718784249965367088878135788567086616180536047830052899711042568402989974727201843537 ) 

(- 10 , 20964009631208954520918551253653792022251672637669648478775352234648622483831508661058580485766883 ) 

(- 2 , - 2452654164796263512959165565377597147123165793392501354290866811059027722610119847271723828796771 ) 

( 4 , - 147052793673461538086506530668357561094222647824631587108653738299104036416816746499164729738165 ) 

( 4 , 13963798812911120130279709463801187300299288356253613055092055278693493066888731286397000735637 ) 

(- 2 , - 2243059469029860341733582498093820347780557194768743665298129222007849423064256685446565258585 ) 

( 3 , - 332536984737070293356781633953414109182889100380596552532026039092003149663826893067835527979 ) 

( 3 , - 53869672449708954071201037343980172216194792353383517263934399693845855469016876949250112145 ) 

( 2 , 8809243662589899724441862302427489374063110417101583652380728682887905485821254651162082773 ) 

(- 3 , 1523916850396224929683411116520822483473365693967880645688939842558533917993725763450387843 ) 

(- 2 , - 175448936439219351474943190408140984880152096488165074286001873576813341763276791632528051 ) 

( 4 , - 7172119525392604758586912757900411673673727956088230007355764602327785267343320520772285 ) 

( 6 , 411983586217791325733798025629672849513780286785778486048473044819412478399711749331293 ) 

(- 3 , 39952378254046331640985245359853148480439674672046852693477940886873371190791503779555 ) 

(- 4 , 2430418964222694480671212916136321892771426164807894584775183594905868264571397226917 ) 

(- 4 , - 200481638950679212651370696237549969836787003530772668084652658004782296755431254661 ) 

( 3 , - 15809036670606058719969662268090458772980604046280756287155668401840558844802721257 ) 

( 5 , - 1523585718315717333817603063481778938222355282742962613553158680390832265726743003 ) 

( 2 , 184311493695616947777956105808291545959771200085283101706038290472860989663874931 ) 

(- 4 , 18608721351129642487382805128934749545584400507726296787447608402705855634753005 ) 

(- 3 , 1869347445173273085208023229014781878050128548111311960144791632389450110085991 ) 

(- 4 , 178380730096884354033504909171901616894208529518268605707305999380528892537777 ) 

(- 3 , 8087485289366444463103610663066672869280303202819888635266439607722241722651 ) 

(- 8 , 330989870518604330321501498261660421116420485572033339979389384876143961861 ) 

(- 3 , - 18664536615969020483601618613669413560772978432066084296155744931355960393 ) 

( 6 , - 1412504875497260093376398626643530047287435788085760178403662982434702419 ) 

( 2 , 219996492734844884145785372293460505348942755871479135939553072964463795 ) 

(- 3 , 23516043605665539210119254241067982862993090461007215280854551535870327 ) 

(- 3 , - 3077511624656829785194435187496372328800530833832937512976380187853603 ) 

( 2 , 297152112917428186473487402530178004938361047882953462356887529974435 ) 

(- 5 , 20430479687841138388651786490336214660319513581933064107387538439725 ) 
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(-3, -715710522544461579764349638671583037687849172262961920648424641623) 

(9, 14179513470875975149484884956525606261632898759917041924304692531) 

(-5, -1282525333271264792977028780361879602218745339300046465268399291) 

(2, 7133580613529858737331990335086812687125553016512866750439935) 

(-89, -31402743133518053709238680774128360213498381554795257802491) 

(3, -5670014347280097827799124543796297689457759679846318408801) 

(2, 466884359172087417048784300784530655902510368684415610549) 

(-6, 35846124997992718851920200094845005686531518229928579835) 

(-2, -5174946979118686463949165359305774723028412521870150151) 

(3, -63863243252949591186892355477373614132946424509565125) 

(27, 742593578199047322036735798245427373773251928151497) 

(0, -742593578199047322036735798245427373773251928151497) 



Output: the reduced ideal equivalent to 21 

A1 = 956239841432722652133553576334329186177525820778149 
B1 = -742593578199047322036735798245427373773251928151497 
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Abstract. Most of the data transmission networks used today are based 
on the technology of the Synchronous Digital Hierarchy (SDH) or Syn- 
chronous Optical Networks (SONET) respectively. However rarely, they 
support any security services for confidentiality, data integrity, authenti- 
cation or any protection against unauthorized access to the transmitted 
information. It is the subscriber’s responsibility to apply security measu- 
res to the data before the information is passed on to the network. The 
use of encryption provides data confidentiality. This, however, requires 
consideration of the underlying network technology. The method descri- 
bed in this paper allows the use of encryption in broadband networks. 
The advantages of this method are the transparency of the encryption 
applied to the signal structure and signal format, and the automatic re- 
synchronization after transmission errors. The used mode of operation, is 
called ’’statistical self-synchronization”, because the synchronization bet- 
ween encryption and decryption is initiated by the presence of a certain 
bit pattern in the ciphertext, which occurs statistically. An encryption 
device, designed for SDH/SONET-networks with transmission rates of 
622 Mbit/s, is to be presented. 

Keywords: Broadband Networks, SDH/SONET, Confidentiality, Cryp- 
tography, Encryption, Modes of Operation, Self-Synchronization 



1 Introduction 

The increased requirement of bandwidth capacity over the last years has promp- 
ted for the development of efficient, digital transmission systems. Modern trans- 
mission systems provide higher transmission rates with a better bandwidth/cost 
ratio. SDH or SONET are the technologies that do not only fulfill this demand, 
but also offer the possibility of enhanced network management, controllable qua- 
lity of service and simple multiplex structure. They are based on common in- 
ternational standards. One specific property of these types of networks is the 
supply of one central clock for all of the network components, like multiple- 
xers or cross-connects. These types of networks are thus also called synchronous 
networks. 



C.K. Koc and C. Paar (Eds.): CHES’99, LNCS 1717, pp. 340-352, 1999. 
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The whole range of information transmission, e.g. from phone call to video 
transmission, is handled via this type of networks. SDH/ SONET is also largely 
used for Corporate Networks, which span long distances that cannot be control- 
led. The necessity for encryption technology is therefore especially important in 
this particular instance. There are four main reasons that cause difficulties in 
realizing an encryption technique for synchronous broadband networks: 

— The maximum data transmission rate of 622 Mbit/s, and minimum of 155 Mbit/s 

respectively, and forecast 2.4 Gigabit/s or 10 Gigabit/s in the future 

— Synchronous processing 

— Bit slipping 

— Complex management information 

The synchronous transmission in these networks requires, that the encipherment 
is done with the same rate as network transmission rate. In Europe, there are no 
VLSI encryption components available today, supporting such high throughput 
rates. Therefore, it is necessary to use multiple encryption chips in parallel. The 
standardized and fixed frame structure does not allow for additional synchro- 
nization information of the crypto algorithm. We use the technique of statistical 
self-synchronization, which allows for the synchronization of the decryption al- 
gorithm even in the case of bit slipping. This mode of operation guarantees that 
the correct plaintext is computed at the receiver’s side after an error- propagation 
has occurred. Bit slipping can occur if bits are deleted or additional bits are ad- 
ded by transmission components due to small differences of transmission fines 
data rates (jitter). 

Bit slipping up to a certain extent is however covered by the synchronous 
network components. Even up to three bytes can positively or negatively be 
stuffed into one transmission frame. Therefore, bit- or byte slipping happens only 
if these thresholds are exceeded. This error needs to be taken into consideration 
even if the bit- or byte slipping probability is extremely small as also automatic 
switching of routing due to line drops can cause this. 

The management information contained in the transmission frame requires, 
that specific parts of the frame are not to be encrypted as information which is 
important for network management and processed by the network components 
has to stay in plaintext. It is therefore required that management information 
bypasses encryption. 

Chapter 2 shortly describes the structure of the STM- 1-frame, whose payload 
is to be encrypted. Chapter 3 focuses on the modes of operation of block ciphers 
and shows that the standardized CFB- and OFB-mode are not sufficient for use 
in synchronous broadband networks. A new mode of operation, which we call 
statistical self- synchronization, is presented in chapter 4. Chapter 5 gives a layout 
of the realization of an encryption device for 622 Mbit/s with STM-4 interface or 
STS- 12c respectively. This chapter furthermore gives additional information on 
selected implementation aspects. The presentation concludes with a summary 
and an outlook in chapter 6. 
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2 SDH and SONET 

SDH and SONET are transmission systems with unlimited increasing rates that 
originally have been developed for the use in Wide Area Networks (WANs). No- 
wadays, they are however also used for Asynchronous Transfer Mode (ATM) in 
Local Area Networks (LANs). The standards for SDH and SONET contain not 
only the definitions of interfaces, e.g. transmission rates, formats and multiple- 
xing techniques, but also recommendations for network management. SDH and 
SONET have similar characteristics. SONET is however mainly used in North 
America and is based on a standard frame, called STS-1, with a transmission 
rate of 51,84 Mbit/s whereas SDH, based on a standard rate of 155,52 Mbit/s 
is widely spread in Europe and Asia. The standard SDH frame, called STM-1, 
contains three concatenated STS-1 frames. Both transmission methods offer the 
basis for international data transmission and both support interfaces for existing 
as well as future techniques. The SDH-network interface is specified in ITU-T 
G.707 [5]. 




Mapping 



Fig. 1. SDH-Multiplex Hierarchy 



Eigure 1 shows, how the Synchronous Transport Module 1 (STM-1) can be 
constructed. The signals of the tributary systems that use SDH as a transport 
network are mapped into a standardized container (C-x). Through the adding of 
stuffing bits and control information, the so-called Path Over Head (POH), we 
get a Virtual Container (VC-x). The STM-1 module is constructed by adding a 
pointer which directs to the first byte of the Virtual Container and adding the 
Section OverHead (SOH) which consists of the Regenerator Section OverHead 
(RSOH) and the Multiplexer Section OverHead (MSOH) (see Eigure 2). 

The SDH is built in a modular way, whereby the STM-1 builds the basis 
for all higher transmission rates. Higher transmission rates are gained by byte- 
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wise multiplexing 4-n STM-1 frames (n = 1,2, etc) to one STM-4*n- frame. In 
this way, the next level of the hierarchy is the STM-4, which offers a capacity 
of 622 Mbit/s. A same structure is also defined in SONET, however, is called 
differently. The STM-1 corresponds to a SONET STS-3c and the STM-4 to an 
STS-12C frame. 

The encryption device described in this paper processes STM-4 frames. The 
622 Mbit/s data stream is internally split into four 155 Mbit/s streams, which 
are passed over to the encryption modules byte by byte. 



STM-1 




SOH and POH contain bytes for frame synchronization, signaling of the frame 
structure, service quality monitoring, path identification, alerts and alert respon- 
ses. Some overhead bytes bypass the encryption and enter the next SOH or POH 
respectively, to be transmitted in plaintext. Others have to be re-calculated, e.g. 
parity bytes, which check the integrity of the payload. Again others have to be 
encrypted as they contain stuffed user information (stuffing). Consequently, it 
is required that the encryption modules are indicated which bytes are not to be 
enciphered. 

3 Encryption 

Encryption algorithms can be split into two groups. They either belong to the 
category of stream ciphers or block ciphers. Block ciphers encrypt blocks of bits 
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in one step; a block length of 64 bit is usual. In contrast to this, only single bits 
or bytes are encrypted by stream ciphers. It would be beneficial to use stream 
ciphers for encryption of data units transmitted in broadband networks, as they 
reduce the delays in encryption devices. Cryptographically secure stream ciphers 
are however rarely known. The correspondingly required VLSI-chips, which in 
any case are needed for high-speed data rates, are not available. On the other 
hand, each block cipher can be turned into a stream cipher if it is used in an 
appropriate mode of operation. This approach is chosen here. 

The four modes of operation, defined so far in ISO 10116 [4], are quite dif- 
ferent in their properties regarding security, synchronization, error propagation, 
delay and throughput. (Note: We expect that this standard will be extended 
by new modes appropriate for high speed applications, e.g. the ATM Counter 
Mode, at the next release). 

In order to turn block ciphers into stream ciphers they are used as key stream 
generators. Two modes do exist for this: either the Cipher FeedBack Mode 
(CFB) is used for self-synchronizing stream ciphers or the Output FeedBack 
Mode (OFB) is taken for synchronous stream ciphers. 



3.1 The CFB-Mode 

Encryption in the CFB-mode is achieved by XOR-ing the plaintext with the 
output of a key stream generator. The key stream is generated by the block 
cipher TV, whereby A is a secret key. The input data of the algorithm is buffered 
in an input shift register. Since the last revision of ISO 10116, this input shift 
register can also be bigger than the block length of the block cipher in order to 
enable parallel encryption units for the high-speed generation of the key stream. 
In a standard case, n bits of the ciphertext are fed back into the input shift 
register, i.e. if n bits of the generated key stream are used for the encryption 
of n bits of the plaintext. Adjustments of the word formats could require the 
stuffing of the fed back ciphertext. This stuffing is explained in ISO 10116 in 
detail. Therefore, the definition of the CFB-mode in ISO 10116 is a bit more 
complicated than shown in Figure 3. 

The CFB-mode offers the huge benefit of self-synchronizing. If a synchro- 
nization error occurs by erasing or adding a ciphertext unit of n bits length 
(corresponding to an n-bit slipping), the de-crypting side only generates wrong 
plaintext until the defect ciphertext units are shifted out of the input shift regi- 
ster. The same behavior occurs, if bits have been modified during transmission. 

It however turns out that the implementation of the CFB-mode requires a 
very high encryption throughput rate. Assuming that n-bit by n-bit of plaintext 
are to be enciphered, then a complete input block needs to be encrypted in 
order to gain n-bit of cipher text. If V is the throughput rate of the block cipher 
implementation, the effective encryption rate, with which plain text in CFB- 
mode can be encrypted, applies as follows: 



V . — 

64 
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Fig. 3. CFB-Mode 



If n = 1 is selected to receive a self- synchronization even in the case of 
bit slipping, then an encryption capacity of approximately 40 Gigabit/s per 
transmission direction is required to encrypt a STM-4-interface of 622 Mbit/s 
with a payload of approximately 600 bit/s. The CFB-mode therefore cannot be 
used in this way for broadband networks. 



3.2 The OFB-Mode 

The OFB-mode, in contrast to the CFB-mode, does not feed back the cipher- 
text into the input shift register, but the generated key stream. In this way, a 
complete output block of the key stream can be XORed with the plaintext for 
encryption, even if this is achieved only n-bit by n-bit (see Figure 4). The ef- 
fective encryption rate therefore equals the encryption rate V of the key stream 
generator. A simplified description has been chosen once again, because the fee- 
ding back of smaller units, as well as adjustments of word formats are considered 
in more detail in ISO 10116. 

The OFB mode offers the benefit of a high data throughput but not a self- 
synchronization. Therefore this type is also called a synchronous stream cipher. 
The fact that the transmitted cipher text is not used for the generation of the 
key stream means that the cryptographic synchronization is completely lost, 
and also cannot be recovered after the occurrence of synchronization errors. On 
the other hand, no error propagation happens if bits have been modified during 
transmission. 
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Fig. 4. OFB-Mode 



4 The Statistical Self-Synchronization 

The two described stream cipher modes of operation of block ciphers show big 
differences in their properties. The CFB is self-synchronizing, but only offers a 
low data throughput and error propagation. The OFB in contrast is not self- 
synchronizing, but rather shows no error propagation as well as a higher encryp- 
tion rate. 

The optimal solution would therefore be to combine the properties of both 
modes of operation. This is succinctly done by a new mode of operation, which 
we call statistical self-synchronization. 

The statistical self-synchronization switches from one mode of operation to 
the other, and back whereby synchronization is reached between encryption and 
decryption by using the CFB-mode. OFB-mode is used between the synchro- 
nization phases. Loss of synchronization occurs in case of bit- or byte slipping. 
In order to re-synchronize, both sides need to be switched to CFB mode. The 
encryption and decryption are kept in CFB mode unless the input shift registers 
are filled with a complete block of ciphertext. This has to be identical on both 
sides. The content is used as a new starting value whereby OFB-mode is re-used 
afterwards (see Figure 5). 

The decryption side, however, can not recognize when the synchronization 
has been lost. Both sides search for a fixed bit pattern in the ciphertext, as there 
is no additional communication capacity between the encryption and decryption 
entities to signal a switch in modes. This bit pattern occurs in a statistically 
distributed way in the ciphertext. Once the pattern is found, both sides switch 
to CFB-mode. The length of the bit pattern defines the probability of the syn- 
chronization and needs to be chosen in relation to the probability of bit slipping. 
The content of the bit pattern can be selected randomly, as all bit patterns of 
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Fig. 5. Statistical Self-Synchronization 



a fixed length are equally probable in the ciphertext. A bit slip causes a loss 
of synchronization, because the OFB mode is used between the synchronization 
phases. Encryption and decryption are out of synchronization till the bit pat- 
tern occurs in the cipher text. On the other hand, a switch into CFB-mode is 
achieved even in the case that no synchronization loss has occurred. This is the 
reason, why we call this mode of operation ’’statistical self-synchronization”. 

It should be emphasized again, that the bit pattern is generated by the en- 
cryption process itself as result of the encryption of the plaintext. No additional 
bandwidth is necessary to signal the synchronization start, or re-synchronization 
start, respectively. 

A switching to the slower CFB mode implies for the encryption that during 
the operation in OFB-mode, as many key stream blocks need to be stored in 
the output buffer as are necessary to encrypt the plaintext during the next 
synchronization phase. Therefore, the encryption rate in OFB-mode must be 
higher than the transmission rate. 

During a synchronization phase, another synchronization is not to be initia- 
ted. Therefore the bit pattern recognition is switched off during the synchroniza- 
tion process. 

5 Implementation 

Figure 6 shows the design of the encryption device. There are two essential 
components: the line interface, and the encryption and decryption module, res- 
pectively. 

It is the task of the line interface to convert the 622 Mbit/s data stream into 
a format, which can be handled by the encryption/decryption module. During 
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this process, plaintext and ciphertext are processed separately, i.e. different com- 
ponents are used (Red-Black-Separation). Otherwise it can happen that plain 
text could end up in the encrypted data stream due to malfunctioning of the 
interface components, or due to cross talking. 

A direct connection between plaintext and ciphertext side exists only for the 
overhead, for different network alerts and for the high frequency reference clock, 
which are bypassed transparently. 

An additional signal is passed together with the data stream from the line 
interface to the encryption/decryption module, in order to indicate which bytes 
do not have to be encrypted/decrypted as these positions contain overhead in- 
formation. 
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Fig. 6. Design of the Encryption Device 



The SDH frame structure, which is supported by the encryption device, is 
shown in bold in Figure 1. It is a VC-4, which has been mapped into one STM-1 
frame. One STM-4 frame consists of four STM-1 frames. This structure has been 
selected, as it is the most flexible one, as all tributary signals are transmitted 
in the VC-4. It is also used in ATM networks using SDH on the physical layer. 
The path via VC-3 and AU-3 is not very common in Europe and only serves as 
an adjustment to SONET. 

5.1 Implementation of the Line Interface 

The line interface has an optical input and output, as an electrical transmission is 
hardly possible at such high bit rates. The input signal is changed to an electrical 
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signal and paralleled. This is followed by a demultiplexer which splits the STM- 
4 frame into four STM-1 frames and processes the STM-4 Section Overhead. 
The SOH bypasses the encryption and is reassembled at the transmitter side. 
There exist certain bits that have to be recalculated at the transmitter side. For 
each byte position of the SOH it is decided whether the received byte or the 
re-calculated byte is forwarded to the transmitted SOH (see Figure 7). 

The four STM- 1-frames are bytewise passed on to the encryption or decryp- 
tion. The high frequency part is designed in ECL technology, whereas TTL 
technology is used for the other (approx. 20 MHz) one. 



I 4 

I I 




Fig. 7. Implementation of the 622 Mbit/s- Line Interface 



No encrypted data is to be placed at byte positions that are used for overhead 
(SOH and POH). Encryption is therefore switched off during the presence of 
overhead data, i.e. overhead data passes the encryption in plaintext. The overhead 
bytes are recognized by analysis of the frame signal, evaluation of the pointer, 
and the counting of the byte positions. 
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Figure 8 shows the method for overhead bypassing. The SOH processing com- 
ponent has an interface that outputs the overhead in a serial form. The com- 
ponent supports receive clock, transmit clock, and a frame signal that indicates 
the overhead positions. It is required to synchronize the overhead on both sides. 
Two FIFOs are used for synchronization, which store the overhead, alternating 
frame by frame. Both FIFOs are used in an alternating way as synchronization 
is to be guaranteed even in the case of errors, i.e. interrupted overheads due to 
line drops. Each FIFO is reset after the overhead of a frame has been read out. 
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Fig. 8. Overhead-Bypassing 



5.2 Implementation of the Encryption 

A DES-encryption chip that offers an encryption rate of 160 Mbit/s is used 
for the key stream generation. DES is a block cipher with a block length of 64 
bit [3]. The alternating mode that works with the double key length of 128 bit 
or 112 relevant bit respectively [2] is used. Each of the STM- 1-data streams is 
encrypted separately and EIEOs are used as input as well as output registers. 
FIFOs are used for different reasons. Encipherment has to be done synchronously 
to the data transmission rate of 155 Mbit/s. The generated key stream has to 
be buffered as encryption does not work with this rate. In addition to this, the 
key stream generation works slowly (in CEB-mode) during the synchronization 
phase. Eor this reason, enough key stream bits have to be accumulated (in OEB- 
mode) during the encryption process so that there are sufficient key stream bits 
available for the then occurring synchronization phase. Therefore one encryption 
chip is not fast enough. DES-components are necessary which read alternate the 
input blocks from the input-EIEO. Thus, the input-EIEO has to be larger than 
one input block (see Eigure 9). 
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In a standard case, i.e. not during synchronization phases, the encryption 
works in OFB-mode. A synchronization phase works like follows. The pre-defined 
bit pattern that triggers self- synchronization is constantly searched in the cipher- 
text. If this pattern is found, the multiplexer is switched over. The encryption 
works now in CFB-mode. Consequently, ciphertext is written into the Input 
FIFO unless a complete block has accumulated (64 bits). In the meantime, the 
plaintext continues to be XORed with the key stream bits which have been gene- 
rated upfront in OFB-mode and which have been buffered in the active FIF03 or 
FIF04. This is continued until the generated and encrypted block in CFB-mode 
is available in the one FIFO that was not the last active one (this can be FIF04 
or FIF03). Then the FIFOs are switched. The plaintext is now XORed with the 
key stream that has been generated in the CFB-mode. This process provides 
self-synchronization. The FIFO, active so far, is now reset and is prepared for 
the next synchronization phase. Please note, that the whole process is by far 
more complex if the layout is designed for a synchronous processing. Switching 
from OFB-mode to CFB-mode has to be done at the same bit position on the 
transmitter side and receiver side. Signal propagation and delay times have also 
to be considered. The fed back ciphertext bits can, for example, only be taken 
from the transmitted ciphertext after a certain period of delay. Furthermore, it 
is necessary to generate an additional key stream block in OFB-mode and store 
it in the new FIFO (directly behind the one in CFB-mode) before the output 
FIFOs can be switched. The bit pattern recognition that is required for the next 
upcoming bit pattern check, can only be released, once the FIFO has sufficiently 
been filled. 




Fig. 9. Realization of the Self-Synchronization 
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There is an alternative concept for the realization of the statistical self- 
synchronization, which uses two separate encryption modules. Each of them 
contains two DES chips. The usage of two complete encryption modules has the 
benefit that the switching between the two different modes of operation is less 
complicated. One of two encryption modules works in OEB-mode the other one 
is idle as long as the bit pattern does not appear. If the bit pattern has been 
found, the second encryption module is initialized using the ciphertext in the 
input shift register. After the generation of one output block, in CEB-mode, this 
encryption module is switched to the OEB-mode. Now, this encryption module 
becomes the working unit. The benefit of this concept lies in the simpler relati- 
onships between the EIEOs and the encryption components. On the other hand, 
two complete encryption modules are required which results in higher costs per 
device. 

6 Summary and Outlook 

It has been demonstrated that it is possible to use encryption technology in 
high-speed networks. Additional channel capacity for synchronization purposes 
is consequently not necessary. The line interface has been developed in coope- 
ration with the Worcester Polytechnic Institute and is available. At present the 
realization of the encryption module is in progress. 

In addition to the used STM-4 frame which has been described in this paper, 
there exists another chained STM-4-frame, the so-called STM-4c. The STM-4c 
does not have a container with multiplexed four STM- 1-frames but consists of 
one VC-4c-container which offers a transmission capacity of 600 Mbit/s. Cur- 
rently we are working on an interface for the STM-4c. The challenges with the 
realization of the line interface for the STM-4c lies in the premise that there 
are no standard chips available which support the STM-4c interface. This me- 
ans that standard components need to be complemented by programmable logic 
components (EPGA) in order to realize a STM-4c interface, e.g. to calculate the 
parity byte over the VC-4c. The encryption can be done through the use of the 
same modules that will be used in the device with the STM-4 interface. 
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